The novel coronavirus disease 2019, also known as COVID-19 outbreak first noted in Wuhan in the end of 2019, has been spreading rapidly worldwide . As an infectious disease, COVID-19 is caused by severe acute respiratory syndrome coronavirus and presents with symptoms including fever, dry cough, shortness of breath, tiredness and so on. As the April 9th, over 1.5 million people around the world have been confirmed as COVID-19 infection with a case fatality rate of about 5.7 % according to the statistic of World Health Organization111https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports.
So far, no specific treatment has proven effective for COVID-19. Therefore, accurate and rapid testing is extremely crucial for timely prevention of COVID-19 spread. Real-time reverse transcriptase polymerase chain reaction (RT-PCR) has been referred as the standard approach for testing COVID-19. However, RT-PCR testing is time-consuming and limited by the lack of supply test kits [23, 19]. Moreover, RT-PCR has been reported to suffer from low sensitivity and repeated checking is typically needed for accurate confirmation of a COVID-19 case. This indicates that many patients will not be confirmed timely [17, 1], thereby resulting in a high risk of infecting a larger population.
In recent years, imaging technology has emerged as a promising tool for automatic quantification and diagnosis of various diseases. As a routine diagnostic tool for pneumonia, chest computed tomography (CT) imaging has been strongly recommended in suspected COVID-19 cases for both initial evaluation and follow-up. Chest CT scans were found very useful in detecting typical radiographic features of COVID-19 . A systematic review  concluded that CT imaging of chest was found sensitive for checking COVID-19 even before some clinical symptoms were observed. Specifically, the imaging features including ground class opacification, consolidation, and pleural effusion have been frequently observed in the chest CT images scanned from COVID-19 patients [11, 30, 25].
Accurate segmentation of these important radiographic features is crucial for a reliable quantification of COVID-19 infection in chest CT images. Segmentation of medical imaging needs to be manually annotated by well-trained expert radiologists. The rapidly increasing number of infected patients has caused tremendous burden for radiologists and slowed down the labeling of ground-truth mask. Thus, there is an urgent need for automated segmentation of infection regions, which is a basic but arduous task in the pipeline of computer-aided disease diagnosis . However, automatically delineating the infection regions from the chest CT scans is considerably challenging because of the large variation in both position and shape across different patients and low contrast of the infection regions in CT images .
Machine learning-based artificial intelligence provides a powerful technique for the design of data-driven methods in medical imaging analysis . Developing advanced deep learning models would bring unique benefits to the rapid and automated segmentation of medical images . So far, fully convolutional networks have proven superiority over other widely used registration-based approaches for segmentation . In particular, U-Net models work decently well for most segmentation tasks in medical images [21, 3, 2, 23]. However, several potential limitations of U-Net have not been effectively addressed yet. For example, the U-Net model is hard to capture the complex features such as multi-class image segmentation and recover the complex feature into the segmentation image . There are also a few successful applications that adopt U-Net or its variants to implement the CT image segmentation, including heart segmentation , liver segmentation , or multi-organ segmentation . However, segmentation of COVID-19 infection regions with deep learning remains under explored. The COVID-19 is a new disease but very similar with the common pneumonia in the medical imaging side, which makes its accurate quantification considerably challenging. Recent advancement of the deep learning method provide heaps of insightful ideas about improving the U-Net architecture. The most popular one is the deep residual network (ResNet) . ResNet provided an elegant way to stacked CNN layers and demonstrate the strength when combined with U-Net . On the other hand, attention was also applied to improve the U-Net and other deep learning models to boost the performance [20, 5].
Accordingly, we propose a novel deep learning model for rapid and accurate segmentation of COVID-19 infection regions in chest CT scans. Our developed model is based on the U-Net architecture, inspired with recent advancement in the deep learning field. We exploit both the residual network and attention mechanism to improve the efficacy of the U-Net. Experimental analysis is conducted with a public CT image dataset collected from patients infected with COVID-19 to assess the efficacy of the developed model. The outstanding performance demonstrates that our study provides a promising segmentation tool for the timely and reliable quantification of lung infection, toward to developing an effective pipeline for precious COVID-19 diagnosis.
The rest of the paper is summarized as follows. We first review some related work about existing deep learning methods for CT image segmentation in Section II Related Work. Our proposed new deep learning model is detailedly described in Section III Methodology, including the U-Net structure, the methods used to improve the encoder and decoder. The experimental study and performance assessment are described in Section IV, followed by discussion and summary of our study.
This section will introduce our proposed Residual Attention U-Net for the lung CT image segmentation in detail. We start by describing the overall structure of the developed deep learning model followed by explaining the two improved components including aggregated residual block and locality sensitive hashing attention, as well as the training strategy. The overall flowchart is illustrated in Fig. 1.
. The traditional U-Net is a type of artificial neural network (ANN) containing a set of convolutional layers and deconvolutional layers to perform the task of biomedical image segmentation. The structure of U-Net is symmetric with two parts: encoder and decoder. The encoder is designed to extract the spatial features from the original medical image. The decoder is to construct the segmentation map from the extracted spatial features. The encoder follows the similar style like FCN with the combination of several convolutional layers. To be specific, the encoder consists of a sequence of blocks for down-sampling operations, with each block including twoconvolution layers followed by a max-pooling layers with stride of 2. The number of filters in the convolutional layers is doubled after each down-sampling operation. In the end, the encoder adopts two convolutional layers as the bridge to connect with the decoder.
Differently, the decoder is designed for up-sampling and constructing the segmentation image. The decoder first utilizes the a deconvolutional layer to up-sample the feature map generated by the encoder. The deconvolutional layer developed by Zeiler et al.  contains the transposed convolution operation and will half the number of filters in the output. It is followed by a sequence of up-sampling blocks which consist two convolution layers and a deconvolutional layer. Then, a
convolutional layer is used as the final layer to generate the segmentation result. The final layer adopted Sigmoid function as the activation function while all other layers used ReLU function.The ReLU and the Sigmoid functions are defined as:
In addition, the U-Net concatenates part of the encoder features with the decoder. For each block in encoder, the result of the convolution before the max-pooling is transferred to decoder symmetrically. In decoder, each block receives the feature representation learned from encoder, and concatenates them with the output of deconvolutional layer. The concatenated result is then forwardly propagated to the consecutive block. This concatenation operation is useful for the decoder to capture the possible lost features by the max-pooling.
Ii-B Aggregated Residual Block
As mentioned in previous section, the U-Net only have four blocks of convolution layers to conduct the feature extraction. The conventional structure may not be sufficient for the complex medical image analysis such as multi-class image segmentation in lung, which is the aim for this study. Although U-Net can easily separate the lung in a CT image, it may have limited ability to distinguish the difference infection regions of the lung which infected by COVID-19. Based on this case, the deeper network is needed with more layers, especially for the encoding process. However, when deeper network converging, a problem will be exposed: with increasing of the network depth, accuracy gets very high and then decrease rapidly. This problem is be defined as degradation problem[9, 27]. He et al. proposed the ResNet 
to mitigate the effect of network degradation on model learning. ResNet utilizes a skip connection with residual learning to overcome the degradation and avoid estimating a large number parameters generated by the convolutional layer. The typical ResNet block is depicted as Fig.2.
The function can be defined as:
where and is the trainable weight for the weight layer. Different from the U-Net that concatenates the features map into decoding process, ResNet adopts the shortcut to add the identity into the output of each block. The stacked residual block is able to better learn the latent representation of the input CT image. However, the model comes more complex and hard to converge as the increase in the number of layers.
Regarding this, Xie et al. proposed Aggregated Residual Network(ResNeXt) and showed that increasing the cardinality was more useful than increasing the depth or width . The cardinality is defined as the set of the Aggregated Residual transformations with formulation as follows:
where is the number of residual transformation to be aggregated and
can be any function. Considering a simple neuron,should be a transformation projecting into an low-dimensional embedding ideally and then transforming it. Accordingly, we can extend it into the residual function:
where the is the output. The ResNeXt block is visualized in Fig. 3. Compared with the Fig. 2, the ResNeXt has a slightly different structure. The weight layer’s size is smaller than ResNet as ResNeXt use the cardinality to reduce the number of layers but keep the performance. One thing is wroth to mention that the three small blocks inside the ResNeXt block need to have the some topology, in the other words, they should be topologically equivalent.
Similar with the ResNet, after a sequence of blocks, the learned features are feed into a global averaging pooling layers to generate the final feature map. Different from the convolutional layers and normal pooling layers, the global averaging pooling layers take the average of feature maps derived by all blocks. It can sum up all the spatial information which captured by each step and is generally more robust than directly make the spatial transformation to the input. Mathematically, we can treat the global averaging pooling layer as a structural regularizer that are helpful for driving the desired feature maps .
Importantly, instead of using the encoder in the U-Net, our proposed deep learning model adopts the ResNeXt block (see Fig. 3) to conduct the features extraction. The ResNeXt provides a solution which can prevent the network goes very deeper but remain the performance. In addition, the training cost of ResNeXt is better than ResNet.
Ii-C Locality Sensitive Hashing Attention
The decoder in U-Net is used to up-sampling the extracted feature map to generate the segmentation image. However, due to the capability of the convolutional neural network, it may not able to capture the complex features if the network structure is not deep enough. In recent years, transformers have gained increasingly interest . The key of the success is the attention mechanism . Attention includes two different mechanisms: soft attention and hard attention. We adopt the soft attention to improve the model learning. Different the hard attention, the soft attention can let model focus on each pixel’s relative position, but the hard attention only can focus on the absolute position. There are two different types of soft attention: Scaled Dot-Product Attention and Multi-Head Attention as shown in Fig. 4. The scaled dot-product attention takes the inputs including a query , a key of the -dimension and a value of the -dimension. The dot-product attention is defined as follows:
where represent to the transpose of the matrix and is a scaling factor. The softmax function with is given by:
Vaswani et al.  mentioned that, perform different linearly project of the queries , keys and values in parallel layers will benefit the attention score calculation. We can assume that and have been linearly projected to dimensions, respectively. It is worth noting that these linear projections are different and learnable. On each projection , we have a pair of query, key and value to conduct the attention calculation in parallel, which results in a -dimensional output. The calculation can be formulated as:
where the the projections , , are parameter matrices and is the weight matrix used to balance the results of layers.
However, the multi-head attention is memory inefficient due to the size of and . Assume that the have the shape where represents the size of the variable. The term
will produce a tensor in shape. Given the standard image size, the length length will take most of the memory. Kitaev et al.  proposed a Locality Sensitive Hashing(LSH) based Attention to address this issue. Firstly, we rewire the basic attention formula into each query position in the partition form:
where the function is the partition function, is the set which query position attends to. During model training, we normally conduct the batching and assume that there is a larger set without considering elements not in :
Then, with a hash function : , we can get as:
In order to guarantee that the number of keys can uniqually match with the number of quires, we need to ensure that where . During the hashing process, some similar items may fall in different buckets because of the hashing. The multi-round hashing provides an effective way to overcome this issue. Suppose there is round, and each round has different hash functions , so we have:
Considering the batching case, we need to get the for each round :
where . The last step is to calculate the LSH attention score in parallel. With the formula (1) and (3), we can derive:
Ii-D Training Strategy
The task of the lung CT image segmentation is to predict if each pixel of the given image belongs to a predefined class or the background. Therefore, the traditional medical image segmentation problem comes to a binary pixel-wise classification problem. However, in this study, we are focusing on the multi-class image segmentation, which can be concluded as a multi-classes pixel-wise classification. Hence, we choose the multi-class cross entropy as the loss function:
where is a binary value which use to compare the correct class and observation class ,
is a probability of the observationto correct class and is the number of classes.
Iii Experiment and Evolution Result
Iii-a Data Description
We used COVID-19 CT images collected by Italian Society of Medical and Interventional Radiology (SIRM)222https://www.sirm.org/category/senza-categoria/covid-19/ for our experimental study. The dataset included 110 axial CT images collected from 60 patients. These images were reversely intensity-normalized by taking RGB-values from the JPG-images from areas of air (either externally from the patient or in the trachea) and fat (subcutaneous fat from the chest wall or pericardial fat) and used to establish the unified Houndsfield Unit-scale (the air was normalized to -1000, fat to -100). The ground-truth segmentation was done by a trained radiologist using MedSeg333http://medicalsegmentation.com/ with three labels: 1 = ground class opacification, 2 = consolidations, and 3 = pleural effusions. A total of 100 samples that have both preprocessed CT images and masks were used for our experimental analysis. These data are publicly available444http://medicalsegmentation.com/covid19/.
Iii-B Data Preprocessing and Augmentation
The original CT images have the size of 512 512. We use the opencv2555https://opencv.org/opencv-2-4-8/ to convert the images into size of 369 369 and grey scale. This processing is helpful to automatically minimize the effects of the black frame in the images and some random noises (e.g., words) on the segmentation.
As our model is based on deep learning, the number of samples will affect the performance significantly. Consider about the size of the dataset, data augmentation is necessary for training the neural network to achieve high generalizability. Our study implements parameterized transformations to realize data augmentation in this study. We rotate the existing images 90 degrees, 180 degrees and 270 degrees to generate another 300 examples. We can easily generate the corresponding mask by rotating with the same degrees. Scaling have the some property with the rotation, so we just scale the image to 0.5 and 1.5 separately to generate another 200 images and its corresponding masks.
Iii-C Experiments setting and Measure Metrics
For the model training, we use the Adma 
as the optimizer. For a fair comparison, we train our model and the U-Net with the default parameter in 100 epochs. Both models are trained under data augmentation and non-augmentation cases. We conducted the experimental analyses on our own server consisting of two 12-core/ 24-thread Intel(R) Xeon(R) CPU E5-2697 v2 CPUs, 6 NVIDIA TITAN X Pascal GPUs, 2 NVIDIA TITAN RTX, a total 768 GiB memory.
In a segmentation task, especially for the multi-class image segmentation, the target area of interest may take a trivial part of the whole image. Thus, we adopt the Dice Score, accuracy, and precision as the measure metrics. The dice score is defined as:
where , are two sets, and calculates the number of element in a set. Assume is the correct result of the test and is the predicted result. We conduct the experimental comparison based on a 10-fold cross-validation for performance assessment.
The figure 5 provides two examples about the result images which have data augmentation. The table I shows the measure metric for our proposed model and the U-Net in with data augmentation case and no data augmentation case.
|Model||With Augmentation||No Augmentation|
Based on this table, we can easily find that our proposed method is out-performed than U-Net which the improvement is at least 10% in all three measure metrics. As shown in figure 2(h), we find that the original U-Net almost failed to do the segmentation. The most possible reason is that, the range of interest is very small, and the U-Net do not have enough capability to distinguish those trivial difference.
Iii-E Ablation Study
In addition to the above-mentioned results, we are also interested in the effectiveness of each component in the proposed model. Accordingly, we conduct the ablation study about the ResNeXt and Attention separately to investigate how these components would affect the segmentation performance. To ensure a fair experimental comparison, we conduct the ablation study in the exactly same experiment environment with our main experiments presented in section III-C. We implement the ablation study on two variants of our model: Model without Attention and Model without ResNeXt. Our model without ResNeXt is similar with literature . We just use the M-R to represent it. The results are summarized in Table II, where M-A represents the model without attention and M-R represents the model without ResNeXt block. We can observe that both the attention and ResNeXt blocks play important roles in our model and contribute to derive improved segmentation performance in comparison with U-Net.
|Model||With Augmentation||No Augmentation|
|M - A||0.85||0.82||0.84||0.79||0.74||0.77|
|M - R||0.84||0.81||0.83||0.77||0.76||0.77|
Iv Discussion and Conclusions
Up to now, the most common screening tool for COVID-19 is the CT imaging. It can help community to accelerate the speed of diagnose and accurately evaluate the severity of COVID-19 . In this paper, we presented a novel deep learning-based algorithm for automated segmentation of COVID-19 CT images, which is proved to be plausible and superior comparing to a series of baselines. We proposed a modified U-Net model by exploiting residual network to enhance the feature extraction. An efficient attention mechanism was further embedded into the decoding process to generate the high-quality multi-class segmentation results. Our method gained more than 10% improvement in multi-class segmentation when comparing against U-Net and a set of baselines.
Recent study shows that the early detection of the COVID-19 is very important . If the infection in chest CT image can be detected at early stage, the patients would have the higher chance to survive . Our study provides an effective tool for the radiologist to precisely determine the lung’s infection percentage and diagnose the progression of COVID-19. It also shed some light on how deep learning can revolutionize the diagnosis and treatment in the midst of COVID-19.
Our future work would be generalizing the proposed model into a wider range of practical scenarios, such as facilitating with diagnosing more types of diseases from CT images. In particularly, in the case of a new disease, such as the coronavirus, the amount of ground truth data is usually limited given the difficulty of data acquisition and annotation. The model is capable of generalizing and adapting itself usingonly a few available ground-truth samples. A knowledge-based generative model 
will be integrated to enhance the ability in handling new tasks. Another line of future work lies in the interpretability, which is specially critical for the medical domain applications. Although deep learning is widely accepted to its limitation in interpretability, the attention mechanism we proposed in this work can produce the interpretation of internal decision process at some levels. To gain deeper scientific insights, we will keep working along with this direction and explore the hybrid attention model for generating meaningfully semantic explanations.
-  (2020) Correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases. Radiology, pp. 200642. Cited by: §I.
-  (2018) Recurrent residual convolutional neural network based on u-net (r2u-net) for medical image segmentation. arXiv preprint arXiv:1802.06955. Cited by: §I.
-  (2017) An exploration of 2d and 3d deep learning techniques for cardiac mr image segmentation. In International Workshop on Statistical Atlases and Computational Models of the Heart, pp. 111–119. Cited by: §I.
-  (2020) Chest ct findings in coronavirus disease-19 (covid-19): relationship to duration of infection. Radiology, pp. 200463. Cited by: §IV.
Interpretable parallel recurrent neural networks with convolutional attentions for multi-modality activity modeling. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §I.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §II-C.
-  (2019) Automatic multiorgan segmentation in thorax ct images using u-net-gan. Medical physics 46 (5), pp. 2157–2168. Cited by: §I.
-  (2020) Attention u-net based adversarial architectures for chest x-ray lung segmentation. arXiv preprint arXiv:2003.10304. Cited by: §I, §I.
-  (2015) Convolutional neural networks at constrained time cost. In , pp. 5353–5360. Cited by: §II-B.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §I, §II-B.
-  (2020) Clinical features of patients infected with 2019 novel coronavirus in wuhan, china. The Lancet 395 (10223), pp. 497–506. Cited by: §I.
-  (2020) MultiResUNet: rethinking the u-net architecture for multimodal biomedical image segmentation. Neural Networks 121, pp. 74–87. Cited by: §I.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-C.
-  (2020) Reformer: the efficient transformer. In International Conference on Learning Representations, External Links: Cited by: §II-C.
-  (2020) Coronavirus disease 2019 (covid-19): role of chest ct in diagnosis and management. American Journal of Roentgenology, pp. 1–7. Cited by: §I.
-  (2019) Liver ct sequence segmentation based with improved u-net and graph cut. Expert Systems with Applications 126, pp. 54–63. Cited by: §I.
-  (2020) Diagnosis of the coronavirus disease (covid-19): rrt-pcr or ct?. European Journal of Radiology, pp. 108961. Cited by: §I.
-  (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §II-A.
-  (2020) Automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks. arXiv preprint arXiv:2003.10849. Cited by: §I.
-  (2018) Attention u-net: learning where to look for the pancreas. arXiv preprint arXiv:1804.03999. Cited by: §I, §III-E.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I, §II-A.
-  (2020) Coronavirus disease 2019 (COVID-19): a systematic review of imaging findings in 919 patients. American Journal of Roentgenology, pp. 1–7. Cited by: §I.
-  (2020) Lung infection quantification of covid-19 in ct images with deep learning. arXiv preprint arXiv:2003.04655. Cited by: §I, §I, §I, §IV.
-  (2017) Deep learning in medical image analysis. Annual review of biomedical engineering 19, pp. 221–248. Cited by: §I.
-  (2020) Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19. arXiv preprint arXiv:2004.02731. Cited by: §I, §I.
-  (2020) Emerging 2019 novel coronavirus (2019-ncov) pneumonia. Radiology, pp. 200274. Cited by: §IV.
-  (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §II-B.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §II-C.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §II-C.
-  (2020) A review of the 2019 novel coronavirus (covid-19) based on current evidence. International Journal of Antimicrobial Agents, pp. 105948. Cited by: §I.
-  (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §II-B.
-  (2019) Multi-depth fusion network for whole-heart ct image segmentation. IEEE Access 7, pp. 23421–23429. Cited by: §I.
-  (2010) Deconvolutional networks. In 2010 IEEE Computer Society Conference on computer vision and pattern recognition, pp. 2528–2535. Cited by: §II-A.
Adversarial variational embedding for robust semi-supervised learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 139–147. Cited by: §IV.
Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §II-B.
-  (2020) A novel coronavirus from patients with pneumonia in china, 2019. New England Journal of Medicine. Cited by: §I.