Log In Sign Up

An automatic COVID-19 CT segmentation based on U-Net with attention mechanism

by   Tongxue Zhou, et al.

The coronavirus disease (COVID-19) pandemic has led a devastating effect on the global public health. Computed Tomography (CT) is an effective tool in the screening of COVID-19. It is of great importance to rapidly and accurately segment COVID-19 from CT to help diagnostic and patient monitoring. In this paper, we propose a U-Net based segmentation network using attention mechanism. As not all the features extracted from the encoders are useful for segmentation, we propose to incorporate an attention mechanism to a U-Net architecture to capture rich contextual relationships for better feature representations. In addition, the focal tversky loss is introduced to deal with small lesion segmentation. The experiment results, evaluated on a small dataset where only 100 CT slices are available, demonstrate the proposed method can achieve an accurate and rapid segmentation on COVID-19 segmentation. The obtained Dice Score, Sensitivity and Specificity are 69.1 respectively.


page 3

page 4

page 5

page 9


An automatic COVID-19 CT segmentation network using spatial and channel attention mechanism

The coronavirus disease (COVID-19) pandemic has led to a devastating eff...

Segmentation of Lungs COVID Infected Regions by Attention Mechanism and Synthetic Data

Coronavirus has caused hundreds of thousands of deaths. Fatalities could...

Dual-Attention Enhanced BDense-UNet for Liver Lesion Segmentation

In this work, we propose a new segmentation network by integrating Dense...

COVID TV-UNet: Segmenting COVID-19 Chest CT Images Using Connectivity Imposed U-Net

The novel corona-virus disease (COVID-19) pandemic has caused a major ou...

Spark in the Dark: Evaluating Encoder-Decoder Pairs for COVID-19 CT's Semantic Segmentation

With the COVID-19 global pandemic, computerassisted diagnoses of medical...

1 Introduction

In December 2019, a novel coronavirus, now designated as COVID-19 by the World Health Organization (WHO), was identified as the cause of an outbreak of acute respiratory illness [10]. The pandemic of COVID-19 is spreading all over the world and causes a devastating effect on the global public health. The number of people infected by the virus is increasing rapidly. Up to April 11, 2020, 1,610,909 cases of COVID-19 have been reported in over 200 countries and territories, resulting in approximately 99,690 deaths. And there is no efficient treatment at present 111

A critical step in the fight against COVID-19 is to have effective screening and monitoring of infected patients. In clinical practice, Chest Computed tomography (CT), as a non-invasive imaging approach, can detect certain characteristic manifestations in the lung associated with COVID-19. It is considered as a low-cost, accurate and efficient method diagnostic tool for early screening and diagnosis of COVID-19. It can be evaluated how severely the lungs are affected, and how the patient’s disease is evolving, which is helpful in making treatment decisions [5][6]

. Motivated by this, a number of artificial intelligence (AI) systems based on deep learning have been proposed and results have been shown to be quite promising. Compared to the traditional imaging workflow that heavily relies on the human labors, AI enables more safe, accurate and efficient imaging solutions. Recent AI-empowered applications in COVID-19 mainly include the dedicated imaging platform, the lung and infection region segmentation, the clinical assessment and diagnosis, as well as the pioneering basic and clinical research


Segmentation is an essential step in AI-based COVID-19 image processing and analysis. It delineates the regions of interest (ROIs), e.g., lung, lobes, bronchopulmonary segments, and infected regions or lesions, in the chest X-ray or CT images for further assessment and quantification [8]. There are a number of researches related to COVID-19. For example, Zheng et al. [12] proposed a weakly-supervised deep learning-based software system using 3D CT volumes to detect COVID-19. Goze et al. [2] presented a system that utilises 2D slice analysis and 3D volume analysis to achieve the detection of COVID-19. Jin et al. [4] proposed an AI system for fast COVID-19 diagnosis, where a segmentation model is first used to obtain the lung lesion regions, and then the classification model is used to determine whether it is COVID-19-like for each lesion region.

In this paper, we propose a deep learning based segmentation with the attention mechanism. A preliminary conference version appeared at ISBI 2020 [13]

, which focus on the multi-model fusion issue. This journal version is a substantial extension, including (1) An automatic COVID-19 CT segmentation network. (2) A focal tversky loss function introduced to help to segment the small COVID-19 regions. (3) An attention mechanism is proposed to capture rich contextual relationships for better feature representations.

The paper is organized as follows: Section 2 offers an overview of this work and details our model, Section 3 describes experimental setup, Section 4 presents the experimental results, Section 5 discusses the proposed method and concludes this work.

2 Method

2.1 Network Architecture

Our network is mainly based on the U-Net architecture [7], in which we integrate the attention mechanism, res_dil block and deep supervision. The encoder of the U-Net is used to obtain the feature representations. The feature representation at each layer are input into an attention mechanism, where they will be re-weighted along channel-wise and space-wise, and the most informative representations can be obtained, and finally they are projected by decoder to the label space to obtain the segmentation result. In the following, we will describe the main components of our model: encoder, decoder, and res_dil block, deep supervision and attention mechanism. The network architecture scheme is described in Fig.1.

Figure 1: The proposed network architecture with attention mechanism.

2.2 Encoder and Decoder

The encoder is used to obtain the feature representations. It includes a convolutional block, a res_dil block followed by skip connection. In order to maintain the spatial information, we use a convolution with stride = 2 to replace pooling operation. It’s likely to require different receptive field when segmenting different regions in an image. All convolutions are

and the number of filter is increased from 32 to 512. Each decoder level begins with upsampling layer followed by a convolution to reduce the number of features by a factor of 2. Then the upsampled features are combined with the features from the corresponding level of the encoder part using concatenation. After the concatenation, we use the res_dil block to increase the receptive field. In addition, we employ deep supervision [3] for the segmentation decoder by integrating segmentation layers from different levels to form the final network output, shown in Fig. 2.

2.3 Res_dil Block and Deep Supervision

It’s likely to require different receptive field when segmenting different regions in an image. Since standard U-Net can not get enough semantic features due to the limited receptive field, inspired by dilated convolution [11], we proposed to use residual block with dilated convolutions on both encoder part and decoder part to obtain features at multiple scales, shown in Fig. 2. The res_dil block can obtain more extensive local information to help retain information and fill details during training process.

Figure 2: The architecture of our proposed Res_dil block (left) and Deep supervision (right). IN refers instance normalization, Dil_conv the dilated convolution (rate = 2, 4, respectively). We refer to the vertical depth as level, with higher levels being higher spatial resolution. In the deep supervision part, refers the output of res_dil block of the level in the decoder, refers the segmentation result of the level in the decoder.

2.4 Attention Mechanism

In U-net shaped network, not all the features obtained by the encoder are effective for segmentation. In addition, not only the different channels (filters) have various contributions but also different spatial location in each channel can give different weights on feature representation for segmentation. To this end, we introduced an attention mechanism to both encoder and decoder to take into account the most informative feature representations for segmentation, the architecture is described in Fig. 3.

The individual feature representations from each channel are first concatenated as the input representation , , is the number of channel in each layer. To simplify the description, we take .

In the channel attention module, a global average pooling is first performed to produce a tensor

, which represents the global spatial information of the representation, with its element


Then two fully-connected layers are applied to encode the channel-wise dependencies, , with ,

, being weights of two fully-connected layers and the ReLU operator

. is then passed through the sigmoid layer to obtain the channel-wise weights, which will be applied to the input representation through multiplication to achieve the channel-wise representation , the indicates the importance of the channel of the representation:


In the spatial attention module, the representation can be considered as , , , , and then a convolution operation , with weight , is used to squeeze the spatial domain, and to produce a projection tensor, which represents the linearly combined representation for all channels for a spatial location. The tensor is finally passed through a sigmoid layer to obtain the space-wise weights and to achieve the spatial-wise representation , the that indicates the importance of the spatial information of the representation:


The fused feature representation is obtained by adding the channel-wise representation and space-wise representation:


The attention mechanism can be directly adapted to any feature representation problem, and it encourages the network to capture rich contextual relationships for better feature representations.

Figure 3: The architecture of attention mechanism. The individual feature representations (, , …, ) are first concatenated as , and then they are recalibrated along channel attention module and spatial attention module to achieve the and , final they are added to obtain the rich fused feature representation .

3 Experimental setup

3.1 Dataset

The datasets used in the experiments come from Italian Society of Medical and Interventional Radiology: COVID-19 CT segmentation dataset 222 The dataset includes 100 axial CT images from 60 patients with Covid-19. The image size is 512 pixels x 512 pixels. The images have been resized, greyscaled and compiled into a single NIFTI-file. The images have been segmented by a radiologist using three labels: ground-glass, consolidation and pleural effusion. Since there are severe data imbalance in the dataset, for example, only 25 patients have pleural effusion, we take the three labels as a COVID-19 lesion region.

3.2 Implementation Details

Our network is implemented in Keras with a single Nvidia GPU Quadro P5000 (16G). The network is trained by dice loss and is optimized using the Adam optimizer, the initial learning rate = 5e-5 with a decreasing learning rate factor 0.5 with patience of 10 epochs. Early stopping is employed to avoid over-fitting if the validation loss is not improved over 50 epochs. We randomly split the dataset into 80% training and 20% testing.

3.3 Loss Function

In the medical community, the Dice Score Coefficient (DSC), defined in (5), is the most widespread metric to measure the overlap ratio of the segmented region and the ground truth, and it is widely used to evaluate segmentation performance. Dice Loss (DL) in (6) is defined as a minimization of the overlap between the prediction and ground truth.


where is the number of pixels in the image, is the set of the classes,

is the probability that pixel

is of the tumor class and is the probability that pixel is of the non-tumor class . The same is true for and , and is a small constant to avoid dividing by 0.

One of the limitation of Dice Loss is that it penalizes false positive (FP) and false negative (FN) equally, which results in segmentation maps with high precision but low recall. This is particularly true for highly imbalanced dataset and small regions of interests (ROI) such as COVID-19 lesions. Experimental results show that FN needs to be weighted higher than FP to improve recall rate. Tversky similarity index [9] is a generalization of the DSC which allows for flexibility in balancing FP and FN:


Another issue with the DL is that it struggles to segment small ROIs as they do not contribute to the loss significantly. To address this, Abraham et al. [1] proposed the focal Tversky loss function (FTL).


where varies in the range . In practice, if a pixel is misclassified with a high Tversky index, the FTL is unaffected. However, if the Tversky index is small and the pixel is misclassified, the FTL will decrease significantly. To this end, we used FTL to train the network to help segment the small COVID-19 region.

3.4 Evaluation Metrics

Segmentation accuracy determines the eventual success or failure of segmentation procedures. To measure the segmentation performance of the proposed methods, three evaluation metrics: Dice, Sensitivity and Specificity are used to obtain quantitative measurements of the segmentation accuracy.

1) Dice: It is designed to evaluate the overlap rate of prediction results and ground truth. Dice ranges from 0 to 1, and the better predict result will have a larger Dice value.


2) Sensitivity(also called the true positive rate, the recall): It measures the proportion of actual positives that are correctly identified:


3) Specificity(also called the true negative rate): It measures the proportion of actual negatives that are correctly identified:


where represents the number of true positive voxels, represents the number of true negative voxels, represents the number of false positive voxels, and represents the number of false negative voxels.

4 Experiment Results

In this section, we conduct extensive comparative experiments including quantitative analysis and qualitative analysis to demonstrate the effectiveness of our proposed method.

4.1 Quantitative Analysis

To assess the performance of our method, and to analyze the impact of the proposed components of our network, we did an ablation study with regard to the attention mechanism and Focal Tversky Loss function (FTL), the results are shown in Table 1. We can observe the baseline U-Net trained with DL achieves dice score, sensitivity and specificity of 61.0, 61.3, 98.3, respectively. However, using the focal tversky loss can aide the network to focus more on the false negative voxels, which can contribute to a better dice score and sensitivity. We can also observe in Table 1 that our proposed method AU-Net improves the results of the baseline U-Net. Integrating the attention mechanism to the segmentation network can boost the performance, since we can see an improved performance on dice score and sensitivity. The main reason is that the attention mechanism can help to emphasis on the most important feature representation for segmentation. In addition, the AU-Net model trained by FTL combines the benefits of attention mechanism with FTL to outperform all other methods.

Model Parameters Dice Sensitivity Specificity
U-Net + DL , 61.0 61.3 98.3
U-Net + FTL , , 66.7 74.0 97.3
AU-Net + DL , 68.5 71.1 98.1
AU-Net + FTL , 69.1 81.1 97.2
Table 1: Comparison of different methods on COVID-19 CT segmentation datase, AU-Net denotes our proposed attention mechanism based network, bold results show the best score.

4.2 Qualitative Analysis

In order to evaluate the effectiveness of our model, we randomly select several examples on COVID-19 CT segmentation dataset and visualize the results in Fig. 4. From Fig. 4, we can observe that the baseline U-Net trained with DL could give a rough segmentation result. With the application of attention mechanism and focal tversky loss, it can segment more small lesion regions with a much better result. In addition, the AU-Net with FTL can achieve the result closest to the ground truth. The obtained results have demonstrated that leveraging the attention mechanism and FTL can generally enhance the COVID-19 segmentation performance.

Figure 4: Examples of the segmentation results on COVID-19 CT segmentation dataset. U-DL denotes the baseline U-Net trained by dice loss, AU-DL denotes the proposed Attention U-Net trained by dice loss, U-FTL denotes the baseline U-Net trained by focal tversky loss, AU-DL denotes the proposed Attention U-Net trained by focal tversky loss.

5 Conclusion

In this paper, we have presented a U-Net based segmentation network using attention mechanism. Since most current segmentation networks are trained with dice loss, which penalize the false negative voxels and false positive voxels equally, contributing a high specificity but low sensitivity. To this end, we applied the focal tversky loss to train the model to improve the small ROI segmentation performance. Moreover, we improve the baseline U-Net by incorporating the attention mechanism in each layer to capture rich contextual relationships for better feature representations. The experiment results demonstrate the effectiveness of our proposed method. However, the study is limited by the small dataset, we believe that with a larger training dataset, our proposed method can achieve more competitive results.


  • [1] N. Abraham and N. M. Khan (2019) A novel focal tversky loss function with improved attention u-net for lesion segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pp. 683–687. Cited by: §3.3.
  • [2] O. Gozes, M. Frid-Adar, H. Greenspan, P. D. Browning, H. Zhang, W. Ji, A. Bernheim, and E. Siegel (2020) Rapid ai development cycle for the coronavirus (covid-19) pandemic: initial results for automated detection & patient monitoring using deep learning ct image analysis. arXiv preprint arXiv:2003.05037. Cited by: §1.
  • [3] F. Isensee, P. Kickingereder, W. Wick, M. Bendszus, and K. H. Maier-Hein (2017) Brain tumor segmentation and radiomics survival prediction: contribution to the brats 2017 challenge. In International MICCAI Brainlesion Workshop, pp. 287–297. Cited by: §2.2.
  • [4] S. Jin, B. Wang, H. Xu, C. Luo, L. Wei, W. Zhao, X. Hou, W. Ma, Z. Xu, Z. Zheng, et al. (2020) AI-assisted ct imaging analysis for covid-19 screening: building and deploying a medical ai system in four weeks. medRxiv. Cited by: §1.
  • [5] L. Li, L. Qin, Z. Xu, Y. Yin, X. Wang, B. Kong, J. Bai, Y. Lu, Z. Fang, Q. Song, et al. (2020) Artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct. Radiology, pp. 200905. Cited by: §1.
  • [6] F. Pan, T. Ye, P. Sun, S. Gui, B. Liang, L. Li, D. Zheng, J. Wang, R. L. Hesketh, L. Yang, et al. (2020) Time course of lung changes on chest ct during recovery from 2019 novel coronavirus (covid-19) pneumonia. Radiology, pp. 200370. Cited by: §1.
  • [7] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.1.
  • [8] F. Shi, J. Wang, J. Shi, Z. Wu, Q. Wang, Z. Tang, K. He, Y. Shi, and D. Shen (2020) Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19. arXiv preprint arXiv:2004.02731. Cited by: §1, §1.
  • [9] A. Tversky (1977) Features of similarity.. Psychological review 84 (4), pp. 327. Cited by: §3.3.
  • [10] J. T. Wu, K. Leung, and G. M. Leung (2020) Nowcasting and forecasting the potential domestic and international spread of the 2019-ncov outbreak originating in wuhan, china: a modelling study. The Lancet 395 (10225), pp. 689–697. Cited by: §1.
  • [11] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §2.3.
  • [12] C. Zheng, X. Deng, Q. Fu, Q. Zhou, J. Feng, H. Ma, W. Liu, and X. Wang (2020) Deep learning-based detection for covid-19 from chest ct using weak label. medRxiv. Cited by: §1.
  • [13] T. Zhou, S. Ruan, and S. Canu (2020) A multi-modal fusion network based on attention mechanism for brain tumor segmentation. In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI 2020), Cited by: §1.