The recent pandemic COVID-19 has disrupted life throughout the world. An effective characterization of its lesions on the lung, involving variable size and texture, holds promise in its early detection and improved prognosis. Artificial Intelligence (AI) can be effectively used to detect these abnormalities, and extract textural features corresponding to the virus-specific markers; thereby resulting in faster detection and analysis of the infection level. Comparing the segmentation performance ofNet [ronneberger2015u] and SegNet [saood2021covid]
demonstrates the efficacy of deep learning for identifying target COVID-19 lesions. Goncharovet al. [goncharov2021ct] used a multi-task framework to simultaneously segment the lesions, followed by the classification of the patients based on the percentage of infected lung regions. However, it resulted in a lower Dice Score Coefficient (DSC). An attention mechanism was introduced [oktay2018attention] as a way to highlight only relevant activations during training. This reduced computational resources wasted on irrelevant activations and improved network generalization. Deep supervision [lee2015deeply]
, in the hidden layers of a deep network, allowed learning of more discriminative features while overcoming the vanishing gradient problem.
The D2A Net [zhao2021d2a] consists of attention modules incorporated into the basic Net architecture to boost performance, but yields a low recall score. This is indicative of the presence of False Negatives (FN), which is highly undesirable in the context of medical image segmentation. A pair of -Nets were coupled [xie2021duda] to initially extract the lung region and then emphasize the COVID-19 lesions. However the large number of parameters, in conjunction with limited data, could lead to overfitting.
Early-stage pathologies on CT scans (like GGOs) have low contrast and blurred edges, with varying shapes and sizes. It also becomes difficult to precisely segment a COVID-affected region of interest, due to possible interference from neighboring regions (such as the heart and bronchi). Tackling these challenges requires an advanced architecture to learn generalized patterns of the lesions exhibiting such variations in terms of shape, size and intensity.
This research improves on the traditional encoder-decoder framework by leveraging features captured at various levels of the encoding path. These are fed to a novel attention module, which reweights the input feature map volume to place greater emphasis on the relevant activations pertaining to the target regions. Such use of multi-scalar features to modulate the weights of the attention mechanism helps capture both coarse, semantically-rich, global details along with the fine-grained, spatial information of the target locality. This enables learning of patterns corresponding to irregularly structured target lesions; mainly due to the involvement of multi-scalar information at all levels. This is termed as “full-scale” in our nomenclature. Assignment of larger weights to the feature maps of interest within the entire input volume, along with highlighting relevant locations within them (while suppressing the rest) directs attention to the pertinent details in the target lesion region. This helps enhance the quality of segmentation. In the following sections, we describe the proposed model FuDSA-Net, followed by a demonstration of its effectiveness on publicly-available COVID-infection data from lung CT.
2 Network Architecture and Attention Mechanism
Here we outline the architecture of the proposed Full-scale Deeply Supervised Attention Network (FuDSA-Net) and describe its novel attention mechanism.
The standard encoder-decoder framework for medical image segmentation tasks [ronneberger2015u] consist of skip connections at various levels. These combine the resultant feature map volume of the encoder with the input at the corresponding level of the decoder, to retain the contextual spatial information for improved localization of the target regions in the final segmented map. However, often inaccurate representations get carried over from the lower layers [oktay2018attention]; particularly, as the stronger features are learned with the activation maps propagating deeper into the network. As a result, the segmentation is often inaccurate leading to poor recall scores.
To overcome this limitation, we introduce attention modules at various tiers to inhibit the activation(s) of the irrelevant region(s). This mechanism acts as a refinement for the encoder volume, before fusing with the decoder volume. The new attention scheme first locates the relevant feature maps across the entire input encoder volume, followed by the detection of important zones within them. A combination of spatial and channel attention is employed for the purpose. Generating weights , and multiplying them by the feature maps of the encoding path, dampen the activation in unimportant areas while amplifying the relevant response(s). Improved delineation of target lesions, encompassing varying shapes and sizes, is achieved through the incorporation of multi-scalar feature maps from all levels of the encoder while evaluating the necessary weights.
The proposed Full-scale Deeply Supervised Attention Network (FuDSA-Net), depicted in Fig. 1, is built on the standard encoder-decoder framework of the -Net. The skip connections are redesigned by incorporating the novel attention module at each level . Additionally, the connections within the decoder arm are restructured to enable each layer to receive accumulated knowledge from all its preceding blocks. This mechanism promotes enhanced feature propagation along the entire decoding path, with minimal loss of information. Deep supervision, applied throughout the decoder, helps learn discriminative patterns at all transitional levels.
Fig. 2 illustrates how the attention module generates the recalibrated output volume, at the third level of the network, from the encoder volume . The input to the attention module (at level ) is a collection of activation maps from all preceding levels of the encoder arm (up to ), encompassing full multi-scalar information expressed as and the output volume of the decoder from level which is . To maintain consistency across the input response maps from the encoder, over multiple scales, we perform a convolution on the high-resolution maps to obtain the set of maps ; thereby, reducing their spatial and channel dimensions until they are of the same dimensions as the lower-resolution maps. The decoder volume also undergoes a point-wise convolution to shrink its number of channels from to . The resultant decoder volume is denoted as .
The maps are element-wise added (), along the channel attention branch, to produce an intermediate resultant volume expressed as
Next a block of several stacked dilated convolution (SDC) kernels, having varying dilation rates, is applied to
. This allows multi-scalar receptive fields, with the benefit of learning at various resolutions, by widening the kernel size without increasing the total number of parameters. Global Average Pooling (GAP) condenses the global information present in each map of the resultant volume into a tensor. This is passed through a multi-layer perceptron (MLP) with sigmoid activation function
and RELU to generate the final weight tensor, to be point-wise multiplied () with the encoder map (at level ) to yield . We have
Let us now consider the spatial attention component. Here the input maps are pixel-wise added, followed by a convolution to learn features from the intermediate outputs. Next the resultant volume is convolved with kernels , followed by activation and upsampling operation to generate the spatial attention map as
3 Experimental Results
The model was trained using 2D CT slices from three publicly available datasets [MedSeg2021], [jun2020covid] and [morozov2020mosmeddata]. Preprocessing resized all slices to 512 × 512. The voxel intensities of all CT volumes, from three data sources, were clipped to place them in the range [1000 HU, 170 HU] to filter out unnecessary details and noise. The inclusion criterion considered only those CT slices containing lesions. Combining such extracted (multi-source) slices, into a single set for the presentation, allowed the model a better exposure towards improved generalization; particularly, in differentiating between various COVID-19 lesion structures and appearances corresponding to different severity levels. Intensity normalization was performed on the multi-source combined dataset. It was then randomly divided into 80% for training and 20% for evaluating the generalization performance.
To examine the role of the constituent components of FuDSA-Net, we did ablation studies involving three variants. These are (i) FuDSA-Net-I, with the spatial attention branch only; (ii) FuDSA-Net-II, with no deep supervision; and (iii) FuDSA-Net-III, excluding the residual connections between the various stages of the decoder. Experimental results for each of these variants are summarised in Table1. The best results are marked in bold in the table. It is observed that incorporating channel attention in FuDSA-Net significantly improves its performance over FuDSA-Net-1, as quantified by a sizeable gain in . This suggests that an enhanced attention mechanism is necessary in order to capture the complicated lesion regions of COVID-19. Improvement is also observed through the involvement of deep supervision as well as the incorporation of intra-stage connections in decoder arms.
A comparison of the proposed FuDSA-Net was also made with baseline architectures, like -Net [ronneberger2015u], -Net++ [zhou2018unet++], Attention -Net [oktay2018attention], and Residual -Net [khanna2020deep], in terms of the different metrics. It is evident from Table 2, that FuDSA-Net outperforms them by a significant margin of around 8% in terms of (as compared to the scores by the second best-performing model, -Net++). A value of signifies a fewer number of False Negative pixels in the generated output. A noticeable gain is also evident in terms of the value by our FuDSA-Net.
The resultant segmentation map of FuDSA-Net is observed to be relatively more accurate and closer to the corresponding ground truth, as illustrated in Fig. 3.
A novel deep learning architecture, FuDSA-Net, has been designed to effectively segment the COVID-19 lesions from lung CT scans. Multi-scalar features were acquired from all stages of the encoding path, for improved modeling of lesions having varying shapes, sizes and intensities. The attention mechanism was unique, incorporating both channel and spatial attention for generating weights. These were used for recalibrating the encoder map volume, prior to concatenation with the input volume at the decoder arm, to minimize unimportant activation from the input. Additional deep supervision enabled direct monitoring of intermediate layers in the decoding pathway. The vanishing gradient problem was avoided through the residual connections added all along the upsampling path. The experimental results demonstrated the superiority of FuDSA-Net, as compared to state-of-the-art methods, in identifying challenging target regions.