The heart is one of the most essential organs in the human body. It is often called the engine of the human body because it is located at the center of the chest and it supplies blood flow throughout the body. Therefore, even small abnormalities in the cardiac system can lead to fatal consequences. Cardiovascular diseases (CVDs) are one of the major causes of death worldwide [McNamara2019]. On average, death due to CVDs occur once every 37s, which this indicates that 2,353 deaths occur each day in the United States [doi:10.1161/CIR.0000000000000757]. According to the World Health Organization, 17.9 million people die each year from CVDs. It has been reported that up to 90% of CVDs can be prevented [doi:10.1161/CIRCULATIONAHA.107.717033, ODONNELL2016761] by in-time diagnosis of heart problems. Applications of computer-aided prediagnosis systems are rapidly developing owing to the advances in computerized imaging devices (e.g., computed tomography (CT) or magnetic resonance imaging (MRI)). Automated segmentation of cardiac CT images can aid in preventing CVDs, such as coronary artery disease, stroke, and cardiomyopathy, in presymptomatic subjects.
In the past decade, deep learning in computer vision especially convolutional neural networks (CNNs) has achieved superior performance. CNNs have replaced several conventional tasks in the computer vision field, including image classification[simonyan2015deep, he2015deep, Huang2017], localization [ren2016faster, redmon2018yolov3] and segmentation [long2015fully]. While it is difficult to gather and handle large-scale medical datasets, CNNs have been widely employed for a vast number of automated medical image segmentation tasks [ronneberger2015unet, milletari2016vnet, Gibson2018, oktay2018attention, chung2020liver]. Notably, Ronneberger et al. [ronneberger2015unet] successfully boosted the integration of CNNs into medical imaging. Their network, named U-Net [ronneberger2015unet], combined the skip connection and the encoder-decoder architecture to propagate low-level spatial information. In addition to the architectural advances, Dice similarity coefficient (DSC) loss was proposed by Milletari et al. [milletari2016vnet], which increased the deep learning performance by resolving the data imbalance problem between foreground and background voxels. Particularly, several deep learning algorithms that are based on CNN have been proposed for cardiac imaging [Li2017, Yu2017, Zheng2018, KHENED201921, Du2019]. A 2.5D multislice network [Zheng2018, Du2019] was utilized for cardiac ventricle segmentation; however, the 3D spatial features were not fully exploited in the 2.5D methods, which employed multiple 2D sliced images. Automatic whole heart (WH) segmentation methods [Li2017, Yu2017] have been researched based on conventional segmentation networks (e.g., FCN and U-Net). The results of the conventional methods showed inferior performances because these methods ignored the identification of accurate boundaries between cardiac substructures. The prior methods were also limited in terms of distinguishing between homogeneous intensity boundaries because of the difficulty in preserving both spatial and shape features.
In this study, we aimed to improve the performance of cardiac segmentation in CT images. Accurate cardiac segmentation is a challenging problem because the heart is composed of several substructures. Fig. 1 shows that the human heart is a complex organ consisting of four chambers, blood vessels, and muscles. The aim of this study was to precisely segment substructures, including the four chambers (left ventricle (LV), right ventricle (RV), left atrium (LA), right atrium (RA)), two arteries (ascending aorta (AA), pulmonary artery (PA)), and myocardium of the left ventricle (LV-myo). The major difficulty in segmenting cardiac suborgans is identifiying the boundary between adjacent organs (e.g., boundary between the atrium and the ventricle). To address this problem, we propose a CNN model with a novel attention mechanism (Fig. 2). To focus on the shape and boundary of cardiac structures, we conjugated the features of the distance transformation (DT) of the labeled ground-truth image, and the contour of the object into the attention mechanism. Specifically, the method drives the CNN model to better learn boundary-aware features based on shape-aware features (i.e., DT). Moreover, the proposed attention mechanism can produce exact segmentation results by reducing the false-positive responses. We employed our proposed shape-aware contour attention mechanism on a conventional encoder-decoder architecture U-Net [ronneberger2015unet].
The remainder of this patper is organized as follows. In section 2, the background, including the visual attention mechanism, shape-prior embedding, and DT are reviewed. The proposed network architecture and its details are described in section 3. The experimental results when compared to other CNNs and the ablations of our proposed network are presented in section 4. Finally, the discussion and conclusions are presented in section 5.
Ii-a Visual Attention Mechanism
The attention mechanism for CNN is widely used for image classification, localization and segmentation [ba2015multiple, zhou2015learning, chen2016attention, wang2017residual]. The intention of the visual attention mechanism is simply to enhance the output of the receptive field around the target objects. Xu et al. [pmlr-v37-xuc15] showed how a model automatically focuses on objects in the image through visualization. Wang et al. [wang2017residual] stacked the attention modules for image classification. The Squeeze and Excitation block[hu2019squeezeandexcitation]
, a type of attention vector, was proposed to recalibrate the feature maps. Based on[wang2017residual, hu2019squeezeandexcitation], the bottleneck attention module (BAM) [bam] and the convolutional block attention module (CBAM) [cbam] were proposed. BAM and CBAM could be easily combined with the universal CNN models and achieved better performance. The channel attention module weighs the importance of features at the channel level, and the spatial attention module encodes the spatial location of objects in the feature maps. However, BAM and CBAM are limited while extracting shape-prior features because the max and average pooling functions weaken the details of shape features by reinforcing spatial location information. To force the model to refine the detailed segmentation of object boundaries, Zhuge et al.  employed contour-supervised features in the attention module. Chung et al. [chung2020liver] introduced a self-supervising contour attention to automatically identify a deep context for segmentation. However, by employing a single contour feature, it is difficult to obtain the attention map in a homogeneous intensity area. In this study, to complement the contour attention, both the DT and contour features are fed to the shape-aware contour attention mechanism to reduce the responses on the background area.
Ii-B Shape-prior Embedding
In the biomedical image segmentation field, the visual attention mechanism was studied to extract shape-prior information from coarse features, based on the fact that most human organs, such as the liver and lung, typically form certain shapes. Based on this idea, DenseVNet was introduced by Gibson et al. [Gibson2018] to segment multiple organs. They attached auxiliary learnable parameters to capture the approximate shape of the target organs. The auxiliary shape-prior feature was concatenated to the final output, providing a hint to the model as an additive attention mechanism. Oktay et al. [oktay2018attention] conducted a self-attention mechanism focusing on shape-prior features. Based on the U-Net architecture, a soft-attention gating module was proposed to utilize spatial contextual information [oktay2018attention]. The authors implemented a self-attention mechanism by connecting high-level features that are sufficient to express the shape-prior context as gating to the attention module. These networks outperformed the other networks by using spatial information; however, a trainable volume of parameters is too coarse to express the details of the shape. Moreover, the proposed self-attention mechanism cannot be applied for delineating the boundaries between the shapes of multiple organs.
Ii-C Distance Transformation for Convolutional Neural Network-based Image Segmentation
The DT operation represents distances for each pixel (or voxel) in terms of a given binary image [BORGEFORS1986344]. Typically, we use the boundary of the foreground area as a set of pixels, and in the DT process, pixels inside the foreground area have a distance to the nearest boundary. DT can be denoted as follows:
where is a predetermined point set, such as the boundary of the foreground, or background, and are points. indicates the function that calculates the distance between and .
Because DT can represent shape information, it has been employed for image segmentation tasks. For example, the DT and watershed algorithms were combined to segment the objects of interest. In recent years, DT has been employed in CNNs, and has shown promising results. Audebert et al. [audebert2019distance] used signed DT as a regression target to detect clear boundaries and shapes. Wang et al. [doi:10.1146/annurev-bioeng-071516-044442] combined deep learning and the watershed algorithm to detect cells. Chung et al. [CHUNG2020103720] calculated the loss between tooth segmentation results and their ground-truth DT to learn similar tooth shape features. Navarro et al. [10.1007/978-3-030-32692-0_71] trained the model using labels, DT from labels, and the contour of the labeled images. For cardiac segmentation, Dangi et al. [Dangi2019] added a complementary decoder to directly regress the DT, and subsequently, the DT was used to regularize the network. The overall limitation of previous works is that they employed an auxiliary task of regressing DT; however, the shape-related features were not successfully trained by the proposed methods. More importantly, DT features have not been explicitly used to improve of boundary delineations.
Iii-a Contour- and Distance-transform-guided Attention Network Architecture
To segment the entire heart from cardiac CT images using a deep learning technique, we designed a novel CNN named contour- and distance-transform-guided attention network (CDA-Net). The proposed architecture is illustrated in Fig. 3. The U-Net [3dunet2016] architecture is used as the backbone of our model. Using the coarse features from the backbone, auxiliary V-transition [chung2020liver] modules obtain the contour and DT features. In addition, a novel penalty energy is developed to mutually complement the contour and DT features. We further integrated these features (i.e., contour and DT) and the outputs of the backbone to feed the attention mechanism. In the following shape-aware attention module, the DT features suppress the feature responses in the background area, allowing the model to suppress false-positive responses. In contrast, the contour features force the model to focus on the detail of the object boundary. Finally, another V-transition is applied to aggregate and refine the features for multi-class segmentation results. The details of the proposed components are described in the subsequent paragraphs.
Contour and Distance Transform Transitions
Based on the backbone network, the contour transition network (CTN) and distance transform transition network (DTTN) were designed to form CDA-Net. The underlying design principles for CTN and DTTN are to predict the contour probability of each voxel and coarse shape-prior information, respectively. The role of CTN and DTTN is to refine the following shape-aware attention module, which is developed to better represent the features of cardiac structures. From the encoder part of the backbone, low-level features are the fed to CTN to generate contour probabilistic map features, and high-level features are input to the DTTN to directly regress the DT features. We utilized the low-level features for the CTN because the contour is a combination of local edge features; conversely, high-level features were used to estimate DT, which requires global shape information for accurate prediction. As shown in Fig.3, each transition block represents a V-transition block (Fig. 4), which is a single-level encoder-decoder module with a channel-wise attention mechanism. The V-transition has advantages in learning multiscale features with only a few parameters, which can help to learn the shape of the heart from multiscale CT images [chung2020liver]. For the multiclass task (i.e., multiple suborgans), class-wise contours and DT were individually used to train the CTN and DTTN, respectively. We applied binary cross-entropy (BCE) loss and mean squared error (MSE) loss to obtain the outputs of CTN and DTTN, respectively. The model predicts the contour probability per voxel for each class and computes the differences between the forwarded contour feature and the ground-truth using the following equation:
Furthermore, because there are only a few contour voxels when compared to the background voxels, we set the weights as [0.001, 0.999] to solve the highly imbalanced classification task.
Similarly, to train the DT feature forthe DTTN, the MSE loss is employed. The MSE loss for the predicted DT map and ground-truth with number of voxels is defined as
In summary, two V-transitions are employed to obtain the contour and DT features from the backbone. These additional features can help the model regress the details of object boundaries by feeding it to the attention mechanism.
Penalty Energy to Refine Features
The outputs of the CTN and DTTN may produce noisy contour probability map and DT features, respectively. Because more precise features can build a fine attention map at a shape-aware attention module, we incorporated the penalty energy to regularize the output of the CTN and DTTN. To regularize the feature map responses, we first applied a sigmoid function to the contour features. Thus, each voxel feature represents the contour probability in the range of. In the case of DT features, we clamped the DT image to distinguish between the foreground and background voxels. Subsequently, we inverted the DT features by subtractingthe value from 1 to flip the foreground and background. Finally, the penalty energy for regularization is defined by calculating the product of the contour features with the subtracted inverse DT features. The penalty energy is obtained as follows:
where the function maintains only the values of within the range ; moreover, and indicate the contour and DT features, respectively. This penalty energy becomes zero if the responses of the contour features do not appeared in the background area of the DT features. Fig. 5 illustrates an example of the penalty energy. Thus, it can reduce the false-positive responses from the contour probabilistic map (i.e., ), primarily by assigning more penalty in the background when compared to the proximate regions.
Shape-aware Attention Finally, the attention mechanism was employed in our CDA-Net. The final shape-aware attention module exploits three different features from the previous layers: 1) contour probabilistic map from CTN, 2) DT feature map from DTTN, and 3) the features from the backbone network. The final attention module can be viewed as contour- and distance- transform-guided shape-aware attention module, which facilitates each shape and contour feature. Fig. 6 illustrates the details of our shape-aware attention block. The input features (i.e., features from the backbone network) are concatenated with the contour and DT features. Subsequently, the attention map is generated by employing a series of convolutional layers and a sigmoid function. The attention map is finally multiplied by an element-wise operator (i.e., ) with the input feature. The output feature response are more representative in the object area primarily owing to the mixed features (i.e., contour and DT). The DT feature guides not only DTTN but also the attention map, which can significantly aid the generation process of the attention map to be more representative inside the organs. We can represent the output of the shape-aware attention as:
where is the attention map,
where denotes the nonlinear sigmoid activation function, is the convolutional layer, and indicates the concatenation of and .
Iii-B Overall Loss Function
For cardiac segmentation, as mentioned above, the contour map and DT are used to formulate the loss function to predict the precise results from the model. The details of the loss function are described subsequently.
First, the data imbalance problem between the foreground and background voxels should be addressed in the segmentation. Therefore, we used Dice loss [milletari2016vnet] to formulate our loss function. That is, to address the multi-class segmentation problem, the generalized DSC (GD) loss [Sudre_2017] is employed. Given the reference image with voxel values , and the predicted probabilistic map with elements , GD loss takes the following form:
where indicates the number of classes to be segmented, and is used to assign weight to each label. is usually determined by the number of class voxels, denoted by . The GD loss is computed between the final probabilistic map outputs of our model and the ground-truth labels, denoted as .
Subsequently, to regress the contour probability map and DT feature, BCE loss (2) and MSE loss (3) are applied for the contour loss and the DT loss , respectively. The penalty energy term is also combined with the our loss function.
Finally, we formulated our loss function to minimize for CDA-Net. We added all the above terms with the weighting coefficients. The final loss function is as follows:
In this equation, is the weight of each term. We used and in the experiments.
Iii-C Learning the Network
To train the CTN and DTTN, the contour images and DTs of the ground-truth labels were precomputed for each substructure. All the subvolumes were concatenated by channel dimension, and formed an structure, where N is the number of substructures, and D, H, and W are the depth, height, and width, respectively. We used contour images to represent the boundary details, and the foreground DT (FDT) to utilize shape information. To acquire the ground-truth contour image, the Prewitt filter was applied in the a 3D direction to generate a gradient image. The FDT was computed using a linear time algorithm [dt_algo].
To overcome the limitation of GPU memory, we resized the image to voxels for the model input. The input images were pre-processed with fixed windowing values ranging between -300 and 1000. We normalized the image voxel values to [0,1]. To generalize the model, Gaussian noise, rotation, and cutout [Shorten2019] were randomly used for data augmentation.
We attempted to train our model to minimize the loss (8). While training the network, we used the Adam optimizer [kingma2017adam]
with a learning rate of 1e-3. The segmentation outputs were obtained by applying the softmax function to the final feature maps. All training experiments was conducted on an Intel 10 core 19-7900X processor and a 24GB Nvidia Titan RTX GPU machine with 128GB memory. We implemented the proposed network using the PyTorch framework[pytorch2019].
Dice similarity coefficient and Jaccard index score of contour- and distance-transform-guided attention network and the state-of-the-art methods on the Multi-Modality Whole Heart Segmentation 2017 (MM-WHS 2017) computed tomography image dataset.
Iv Experimental Results
We trained our model with the Multi-Modality Whole Heart Segmentation 2017 (MM-WHS 2017) dataset333available on http://www.sdspeople.fudan.edu.cn/zhuangxiahai/0/mmwhs/ [ZHUANG201677, Zhuang1900], which provides 60 CT and 60 MRI cardiac images. From each set of 60 images, 20 images with label-annotated data were included for training and 40 images were used for testing. We used only the CT dataset for validation. The Whole Haert Segmentation (WHS) ground-truth data were manually labeled by well-trained students majoring in biomedical engineering or medical physics [ZHUANG201677]. Seven substructures were selected as areas of interest in the WHS study, including:
1) LV: the left ventricular cavity
2) RV: the right ventricular cavity
3) LA: the left atrial cavity
4) RA: the right atrial cavity
5) LV-myo: the myocardium of left ventricle
6) AA: the ascending aorta trunk
7) PA: the pulmonary artery trunk
The training dataset was split into a training/validation set to generalize the model using n-fold cross-validation.
Iv-B Evaluation Metrics
The segmentation results were evaluated using the DSC and Jaccard index (JI). Given the binary labeled masks X and Y, we define the DSC and JI as follows:
We also evaluated the surface distance metrics, i.e., 95% Hausdorff distance (HD) and average symmetric surface distance (ASSD), to demonstrate the performance of the proposed shape-aware attention module. The 95% HD was processed without 5% of the outlying voxels because it is more robust if noisy outliers are avoided. The HD is defined as follows:
where is the shortest distance from an arbitrary voxel to a set of surfaces .
We can also define distance function between two sets as
Then, the ASSD can be defined as follows:
To prove the reduction of false-positive responses, sensitivity and precision were calculated as follows:
where TP, FP, and FN are the numbers of true-positive, false-positive, and false-negative output voxels, respectively.
To assess the cardiac segmentation performance of our model, we compared our proposed CDA-Net with other state-of-the-art networks, i.e., U-Net [3dunet2016], Voxelwise Residual Network (VoxResNet) [chen2016voxresnet], DenseVNet [Gibson2018], and Attention Gate U-Net (AGU-Net) [oktay2018attention].
Quantitative Results: Table I
lists the quantitative results of cardiac segmentation. The DSC and JI score were computed for all cardiac substructures including LV, RV, LA, RA, LV-myo, AA, PA, and WH. Our proposed network outperformed other state-of-the-art networks in terms of the DSC and JI. In particular, CDA-Net achieved high score on in terms of identifying uneven boundary structures i.e., LV-myo, AA, and PA by finding shape-aware features. However, the shape-prior features and self-attention mechanism showed inferior performance in cardiac multiorgan segmentation. DenseVNet which is trained by shape-prior features using an additional tensor, did not succeed in segmenting accurate cardiac organs because the
resolutions of features struggled to learn the details of object shapes. AGU-Net also failed to produce fine segmentation. The primary reason for the failure is that the self-attention mechanism worked to mine common features; consequently, the details of the object (i.e., boundary areas) were easily neglected. The box plots for the paired t-test are illustrated in Fig.7.
Qualitative Results: Figs. 8 and 9 show the visualization of the cardiac the segmentation results. Fig. 8 shows the axial slice of segmentation results and Fig. 9 illustrates the volume and surface of the predicted labels. Notably, U-Net and VoxResNet, which have no shape features, exhibited false-positive responses outside of the heart shapes. Conversely, the networks that learned the heart shapes produced less noisy and smooth surface results. Although, DenseVNet and AGU-Net reduced the false-positive responses, they still had difficulty in determining the boundaries among the substructures. When compared to other results, our proposed network detected precise boundaries without any false-positive responses. The results indicate that the shape-aware attention module successfully suppressed the mis-segmented outputs and concentrated on the boundary area of the target object.
Iv-D Ablation Study
We appended ablation studies to evaluate the proposed method. Table II lists the comparison results of the ablations. base indicates a U-Net backbone network with an appended V-transition module. In base+CBAM, the shape-aware attention module is replaced with a CBAM[cbam]. In addition, ablation experiments were conducted by excluding certain of modules in CDA-Net. base+CTN is the applied attention mechanism with only the contour probability map. Conversely, base+DTTN is composed of the backbone and DTTN for the DT feature. base+CTN+DTTN and base+CTN+DTTN+penalty were compared to verify the effectiveness of the proposed penalty function.
As listed in Table II, our model outperforms the other ablations. base+CTN and base+DTTN showed minimal improvement in the DSC score when compared to the base network. However, base+CTN+DTTN showed a significant improvement when compared to the base network. In particular, our model with the contour probability map and DT feature achieved high scores in 95% HD, ASSD, and precision, indicating that our proposed shape-aware attention module with CTN and DTTN significantly reduced false-positive responses.
Figure 11 shows the attention map of our network and its ablations. The traditional self-attention mechanism CBAM shows sufficient performance while identifying the blob of the target object but reveals weaknesses in detecting the precise boundary of the object. Self-attention with only the contour probability map or DT feature weakens the shape-prior information and the details of the boundary contexts. When we employed both auxiliary modules, the attention map showed a stronger response in the boundary area. Moreover, our proposed penalty energy enhanced the contour responses and preserved the shape-prior information. Accordingly, the contour probability map and DT mutually affected the construction of a complementary feature that forms a fine attention map that has strong responses on the boundary area. In addition, the proposed penalty energy could strengthen the attention map.
V Discussion and Conclusion
Cardiac segmentation is a challenging task because of the ambiguous multiorgan boundaries. In this study, we focused on resolving ambiguous boundaries to improve cardiac segmentation. The proposed network applies the attention mechanism to utilize the details of boundaries between substructures. Our proposed method attempted to obtain shape-prior features using the attention mechanism because the shape features of each organ mutually complement the accuracy of cardiac segmentation. However, the traditional shape-prior methods are inferior because of the following reasons. DenseVNet [Gibson2018] attached the volume tensor for shape-prior features but showed inferior performance in delineating the boundaries of cardiac organs. Self-attention mechanisms, including CBAM [cbam] and AGU-Net [oktay2018attention], were also proposed to obtain precise segmentation, but experienced difficulty in determining the exact contour features, even though the self-attention mechanisms are suitable for identifying the blob of an object. Moreover, the shape-prior methods are data-driven algorithms, which indicate that they require various training data distributions for generalized performance. In summary, the preceding networks typically failed to capture the details of the targeted objects, which indicates that it is difficult to achieve high performance on the adjacent multi-organ segmentation task. To complement the typical shape-prior methods, our shape-aware contour attention mechanism attached the contour and DT features to the attention module. We entangled the boundary-aware and shape-prior features using CTN and DTTN, supervised by the contour and DT of the organs, respectively. The DT feature aided the model to easily focus on the blob of the target object, and the contour feature derived the attention map to obtain a high response in the boundary regions. Moreover, penalty energy was applied to enhance the accuracy of the DT and contour features in the complementary relations. Therefore, the proposed shape-aware contour attention method successfully produced a fine attention map that focused on the boundary of the object while maintaining shape-prior information.
Additionally, our proposed network significantly reduced the false-positive responses. Conventional deep learning networks [ronneberger2015unet, 3dunet2016, chen2016voxresnet, milletari2016vnet]
are based on classifying each voxel, , which is the foundation of future research; however, they failed to reduce the outliers in the background regions. Moreover, when compared to U-Net[ronneberger2015unet] and VoxResNet [chen2016voxresnet], CDA-Net showed superior performance in reducing outliers and refining the segmentation results, especially in the boundary area. This is because our proposed network CDA-Net focused on generating a fine attention map with a has low probability in the background area. Therefore, it can lead to superior performance on multi-organ segmentation problem that require the identification of the exact boundary between organs.
In conclusion, the shape features, including the contour and DT of the target objects, minimized the false-positive errors in the segmentation problem. Prior research synthesized shape-prior information and deep features to clarify the extrapolation of the shape features in medical image segmentation. They succeeded in improving segmentation accuracy; however, they failed to detect clear boundaries and the surface of the target organs. From this aspect, a shape-aware contour-guided attention was proposed in this study to refine multiorgan segmentation based on shape features. The proposed shape-aware contour attention method performs an important role in determining the precise boundary, whereas the shape-prior attention only helps in regressing the shape blob approximation. In addition, the proposed method improved the generalization performance when compared to the other data-driven shape-prior-based methods. The contour and DT guided the network to be trained such that the target organs were better delineated, primarily because a precise attention map was obtained in the contour area. Based on the experimental results, the proposed method, which synthesized the contour and DT, led to improvements in terms of robustness and accuracy in medical image segmentation.