Log In Sign Up

Full-scale Deeply Supervised Attention Network for Segmenting COVID-19 Lesions

by   Pallabi Dutta, et al.

Automated delineation of COVID-19 lesions from lung CT scans aids the diagnosis and prognosis for patients. The asymmetric shapes and positioning of the infected regions make the task extremely difficult. Capturing information at multiple scales will assist in deciphering features, at global and local levels, to encompass lesions of variable size and texture. We introduce the Full-scale Deeply Supervised Attention Network (FuDSA-Net), for efficient segmentation of corona-infected lung areas in CT images. The model considers activation responses from all levels of the encoding path, encompassing multi-scalar features acquired at different levels of the network. This helps segment target regions (lesions) of varying shape, size and contrast. Incorporation of the entire gamut of multi-scalar characteristics into the novel attention mechanism helps prioritize the selection of activation responses and locations containing useful information. Determining robust and discriminatory features along the decoder path is facilitated with deep supervision. Connections in the decoder arm are remodeled to handle the issue of vanishing gradient. As observed from the experimental results, FuDSA-Net surpasses other state-of-the-art architectures; especially, when it comes to characterizing complicated geometries of the lesions.


Mixed Attention with Deep Supervision for Delineation of COVID Infection in Lung CT

The COVID-19 pandemic, with its multiple variants, has placed immense pr...

Label-Free Segmentation of COVID-19 Lesions in Lung CT

Scarcity of annotated images hampers the building of automated solution ...

Deep Co-supervision and Attention Fusion Strategy for Automatic COVID-19 Lung Infection Segmentation on CT Images

Due to the irregular shapes,various sizes and indistinguishable boundari...

CCAT-NET: A Novel Transformer Based Semi-supervised Framework for Covid-19 Lung Lesion Segmentation

The spread of the novel coronavirus disease 2019 (COVID-19) has claimed ...

Differential Diagnosis for Pancreatic Cysts in CT Scans Using Densely-Connected Convolutional Networks

The lethal nature of pancreatic ductal adenocarcinoma (PDAC) calls for e...

Soft Activation Mapping of Lung Nodules in Low-Dose CT images

As a popular deep learning model, the convolutional neural network (CNN)...

1 Introduction

The recent pandemic COVID-19 has disrupted life throughout the world. An effective characterization of its lesions on the lung, involving variable size and texture, holds promise in its early detection and improved prognosis. Artificial Intelligence (AI) can be effectively used to detect these abnormalities, and extract textural features corresponding to the virus-specific markers; thereby resulting in faster detection and analysis of the infection level. Comparing the segmentation performance of

Net [ronneberger2015u] and SegNet [saood2021covid]

demonstrates the efficacy of deep learning for identifying target COVID-19 lesions. Goncharov

et al. [goncharov2021ct] used a multi-task framework to simultaneously segment the lesions, followed by the classification of the patients based on the percentage of infected lung regions. However, it resulted in a lower Dice Score Coefficient (DSC). An attention mechanism was introduced [oktay2018attention] as a way to highlight only relevant activations during training. This reduced computational resources wasted on irrelevant activations and improved network generalization. Deep supervision [lee2015deeply]

, in the hidden layers of a deep network, allowed learning of more discriminative features while overcoming the vanishing gradient problem.

The D2A Net [zhao2021d2a] consists of attention modules incorporated into the basic Net architecture to boost performance, but yields a low recall score. This is indicative of the presence of False Negatives (FN), which is highly undesirable in the context of medical image segmentation. A pair of -Nets were coupled [xie2021duda] to initially extract the lung region and then emphasize the COVID-19 lesions. However the large number of parameters, in conjunction with limited data, could lead to overfitting.

Early-stage pathologies on CT scans (like GGOs) have low contrast and blurred edges, with varying shapes and sizes. It also becomes difficult to precisely segment a COVID-affected region of interest, due to possible interference from neighboring regions (such as the heart and bronchi). Tackling these challenges requires an advanced architecture to learn generalized patterns of the lesions exhibiting such variations in terms of shape, size and intensity.

This research improves on the traditional encoder-decoder framework by leveraging features captured at various levels of the encoding path. These are fed to a novel attention module, which reweights the input feature map volume to place greater emphasis on the relevant activations pertaining to the target regions. Such use of multi-scalar features to modulate the weights of the attention mechanism helps capture both coarse, semantically-rich, global details along with the fine-grained, spatial information of the target locality. This enables learning of patterns corresponding to irregularly structured target lesions; mainly due to the involvement of multi-scalar information at all levels. This is termed as “full-scale” in our nomenclature. Assignment of larger weights to the feature maps of interest within the entire input volume, along with highlighting relevant locations within them (while suppressing the rest) directs attention to the pertinent details in the target lesion region. This helps enhance the quality of segmentation. In the following sections, we describe the proposed model FuDSA-Net, followed by a demonstration of its effectiveness on publicly-available COVID-infection data from lung CT.

2 Network Architecture and Attention Mechanism

Here we outline the architecture of the proposed Full-scale Deeply Supervised Attention Network (FuDSA-Net) and describe its novel attention mechanism.

The standard encoder-decoder framework for medical image segmentation tasks [ronneberger2015u] consist of skip connections at various levels. These combine the resultant feature map volume of the encoder with the input at the corresponding level of the decoder, to retain the contextual spatial information for improved localization of the target regions in the final segmented map. However, often inaccurate representations get carried over from the lower layers [oktay2018attention]; particularly, as the stronger features are learned with the activation maps propagating deeper into the network. As a result, the segmentation is often inaccurate leading to poor recall scores.

Figure 1: FuDSA-Net for COVID-19 lesion segmentation.

To overcome this limitation, we introduce attention modules at various tiers to inhibit the activation(s) of the irrelevant region(s). This mechanism acts as a refinement for the encoder volume, before fusing with the decoder volume. The new attention scheme first locates the relevant feature maps across the entire input encoder volume, followed by the detection of important zones within them. A combination of spatial and channel attention is employed for the purpose. Generating weights , and multiplying them by the feature maps of the encoding path, dampen the activation in unimportant areas while amplifying the relevant response(s). Improved delineation of target lesions, encompassing varying shapes and sizes, is achieved through the incorporation of multi-scalar feature maps from all levels of the encoder while evaluating the necessary weights.

The proposed Full-scale Deeply Supervised Attention Network (FuDSA-Net), depicted in Fig. 1, is built on the standard encoder-decoder framework of the -Net. The skip connections are redesigned by incorporating the novel attention module at each level . Additionally, the connections within the decoder arm are restructured to enable each layer to receive accumulated knowledge from all its preceding blocks. This mechanism promotes enhanced feature propagation along the entire decoding path, with minimal loss of information. Deep supervision, applied throughout the decoder, helps learn discriminative patterns at all transitional levels.

Figure 2: Illustration of the working of attention module, to produce a weighted encoder activation map volume in the third level (), of FuDSA-Net.

Fig. 2 illustrates how the attention module generates the recalibrated output volume, at the third level of the network, from the encoder volume . The input to the attention module (at level ) is a collection of activation maps from all preceding levels of the encoder arm (up to ), encompassing full multi-scalar information expressed as and the output volume of the decoder from level which is . To maintain consistency across the input response maps from the encoder, over multiple scales, we perform a convolution on the high-resolution maps to obtain the set of maps ; thereby, reducing their spatial and channel dimensions until they are of the same dimensions as the lower-resolution maps. The decoder volume also undergoes a point-wise convolution to shrink its number of channels from to . The resultant decoder volume is denoted as .

The maps are element-wise added (), along the channel attention branch, to produce an intermediate resultant volume expressed as


Next a block of several stacked dilated convolution (SDC) kernels, having varying dilation rates, is applied to

. This allows multi-scalar receptive fields, with the benefit of learning at various resolutions, by widening the kernel size without increasing the total number of parameters. Global Average Pooling (GAP) condenses the global information present in each map of the resultant volume into a tensor. This is passed through a multi-layer perceptron (MLP) with sigmoid activation function

and RELU to generate the final weight tensor

, to be point-wise multiplied () with the encoder map (at level ) to yield . We have


Let us now consider the spatial attention component. Here the input maps are pixel-wise added, followed by a convolution to learn features from the intermediate outputs. Next the resultant volume is convolved with kernels , followed by activation and upsampling operation to generate the spatial attention map as


The final output of the attention module is generated, as a combination of the channel and spatial attention components of eqns. (3) and (4), as


3 Experimental Results

The model was trained using 2D CT slices from three publicly available datasets [MedSeg2021], [jun2020covid] and [morozov2020mosmeddata]. Preprocessing resized all slices to 512 × 512. The voxel intensities of all CT volumes, from three data sources, were clipped to place them in the range [1000 HU, 170 HU] to filter out unnecessary details and noise. The inclusion criterion considered only those CT slices containing lesions. Combining such extracted (multi-source) slices, into a single set for the presentation, allowed the model a better exposure towards improved generalization; particularly, in differentiating between various COVID-19 lesion structures and appearances corresponding to different severity levels. Intensity normalization was performed on the multi-source combined dataset. It was then randomly divided into 80% for training and 20% for evaluating the generalization performance.

FuDSA-Net is developed using Tensorflow and Keras in Python 3.9. All experiments are performed on a 12GB NVIDIA GeForce RTX 2080 Ti GPU. The focal Tversky loss function

[abraham2019novel] is used for training along with the Adam optimizer. The learning rate was set to with an early stopping mechanism employed to prevent overfitting. The segmentation performance is determined by reporting the values of Dice Score Coefficient (), Intersection over Union () and metrics.

Model DSC Recall IoU
FuDSA-Net 0.7924 0.8104 0.6681
FuDSA-Net-I 0.7424 0.7046 0.6074
FuDSA-Net-II 0.7639 0.7246 0.6308
FuDSA-Net-III 0.7905 0.7709 0.6650
Table 1: Ablation study on FuDSA-Net

To examine the role of the constituent components of FuDSA-Net, we did ablation studies involving three variants. These are (i) FuDSA-Net-I, with the spatial attention branch only; (ii) FuDSA-Net-II, with no deep supervision; and (iii) FuDSA-Net-III, excluding the residual connections between the various stages of the decoder. Experimental results for each of these variants are summarised in Table

1. The best results are marked in bold in the table. It is observed that incorporating channel attention in FuDSA-Net significantly improves its performance over FuDSA-Net-1, as quantified by a sizeable gain in . This suggests that an enhanced attention mechanism is necessary in order to capture the complicated lesion regions of COVID-19. Improvement is also observed through the involvement of deep supervision as well as the incorporation of intra-stage connections in decoder arms.

Model DSC Recall IoU
FuDSA-Net 0.7924 0.8104 0.6681
-Net 0.6722 0.6515 0.5315
-Net++ 0.7167 0.6889 0.5774
Attention -Net 0.6480 0.6232 0.5071
Residual -Net 0.6476 0.6040 0.5094
Table 2: Comparison of FuDSA-Net with baseline models

A comparison of the proposed FuDSA-Net was also made with baseline architectures, like -Net [ronneberger2015u], -Net++ [zhou2018unet++], Attention -Net [oktay2018attention], and Residual -Net [khanna2020deep], in terms of the different metrics. It is evident from Table 2, that FuDSA-Net outperforms them by a significant margin of around 8% in terms of (as compared to the scores by the second best-performing model, -Net++). A value of signifies a fewer number of False Negative pixels in the generated output. A noticeable gain is also evident in terms of the value by our FuDSA-Net.

(a) Lung CT scan (b) Ground Truth (c) Prediction by FuDSA-Net (d) Prediction by -Net++
Figure 3: (a) Sample CT scan of COVID-19 affected patient, with (b) corresponding ground truth, along with predictions made by (c) proposed FuDSA-Net, and (d) -Net++. The red box highlights the comparison area in each case.

The resultant segmentation map of FuDSA-Net is observed to be relatively more accurate and closer to the corresponding ground truth, as illustrated in Fig. 3.

4 Conclusion

A novel deep learning architecture, FuDSA-Net, has been designed to effectively segment the COVID-19 lesions from lung CT scans. Multi-scalar features were acquired from all stages of the encoding path, for improved modeling of lesions having varying shapes, sizes and intensities. The attention mechanism was unique, incorporating both channel and spatial attention for generating weights. These were used for recalibrating the encoder map volume, prior to concatenation with the input volume at the decoder arm, to minimize unimportant activation from the input. Additional deep supervision enabled direct monitoring of intermediate layers in the decoding pathway. The vanishing gradient problem was avoided through the residual connections added all along the upsampling path. The experimental results demonstrated the superiority of FuDSA-Net, as compared to state-of-the-art methods, in identifying challenging target regions.