Magnetic Resonance Imaging (MRI) is widely used by cardiologists as the golden modality for cardiac assessment . The segmentation of kinetic MR images along the short axis is complicated but essential to precise morphological and pathological analysis, diagnosis, and surgical planning. In particular, one has to delineate left ventricular endocardium (LV), myocardium (MYO), and right ventricular endocardium (RV) to calculate the volume of the cavities in cardiac MRI video, including end-diastolic (ED) and end-systolic (ES) phases .
proposed deep learning based approaches to learn robust contextual and semantic features, achieving the state-of-the-art segmentation performance. However, automatic and accurate 3D cardiac MRI video segmentation still remains very challenging due to significant variations in the subjects, ambiguous borders, inhomogeneous intensity and artifects, especially for RV. There are two key issues that needs to be resolved: (1) In cardiac MRI video, the RV segmentation performance is influenced by complicated shape and inhomogeneous intensity, especially the partial volume effect close to the free wall. (2) Subtle structures (e.g., MYO) have ambiguous borders and different orientations in different anatomical planes of MRI video, causing segmentation inaccuracy. Thus, it is highly desired to have precise and robust cardiac MRI video segmentation.
In this paper, we propose a new Deformable U-Net (DeU-Net) to address the aforementioned issues by fully exploiting the spatio-temporal information from 3D cardiac MRI video and aggregating temporal information to boost segmentation performance. The DeU-Net consists of two parts: Temporal Deformable Aggregation Module (TDAM) and a Deformable Global Position Attention (DGPA) network. To address the partial volume effect of RV in [13, 14], the TDAM utilizes the spatio-temporal information of the MRI video clip to produce fused feature maps by a temporal aggregation deformable convolution. To handle the issue of subtle structures in , the U-Net based DGPA network jointly encodes a wider range of multi-dimensional contextual information into global and local features, guaranteeing clear and continuous borders of every segmentation map. The experimental results quantitatively and qualitatively show that our proposal achieves the state-of-the-art performance on commonly used metrics, especially for cardiac marginal information (ASSD and HD).
The architecture of DeU-Net is plotted in Fig. 1, including a Temporal Deformable Aggregation Module (TDAM) and a Deformable Global Position Attention (DGPA) network. The proposed TDAM consists of two phases: a temporal deformable convolution and an offset prediction network based on U-Net to predict deformable offsets. The fused features produced by TDAM are fed into DGPA for the final segmentation results. The DGPA network which also employs U-Net as backbone introduces deformable convolution for encoders and utilizes the deformable attention block to augment the spatial sampling locations.
2.1 Temporal Deformable Aggregation Module
Many existing methods design very complicated neural networks to achieve performance gain. However, most approaches ignore the spatio-temporal information of 3D MRI video, and treat each frame as a separate object, thereby causing performance degradation. Moreover, in the process of data sampling, various semantic details of the video clips may get lost due to fast variation of cardiac borders and regular convolution, inevitably distorting video local details and pixel-wise connections between the frames. Thus, we propose a Temporal Deformable Aggregation Module (TDAM) to adaptively extract temporal information (motion field) for image interpretation. The proposed TDAM takes a target frame along with its neighboring reference frames as inputs to jointly predict an offset field. Then, the enhanced contextual information can be fused into the target frame by a temporal aggregation deformable convolution.
We denote the dimension of an input 3D MRI video as , where plane is the short-axis plane, is the short axis, and is the temporal dimension. Considering the large inter-slice gap in MRI cardiac images along axis, we select images within the dimension where stronger correlation may exist. Specially, we define as the target frame in a 3D MRI video at time , where and are the height and width of the input feature map. In order to leverage temporal information, we take the preceding and succeeding frames as the reference to improve the quality of the target frame . For a 3D cardiac MRI video clip , the conventional temporal fusion scheme (i.e., Early Fusion ) can be formulated as a multichannel convolution directly applied on the target frames as:
where represents the quality-enhanced feature map. is the size of convolution kernel and denotes the kernel for -th channel. represents an arbitrary spatial position and indicates the regular sampling offsets. However, noisy content can be easily introduced due to cardiac temporal change in video. Inspired by Multi-Frame Quality Enhancement  for Video Quality Enhancement (VQE), we design TDAM to augment the regular sampling offset with extra learnable offset for the potential spatio-temporal correlations:
Note that the deformable offset is designed for each convolution window centered at a spatio-temporal position , as shown in Fig. 1. We propose to take the whole clip into consideration and jointly predict all the deformable offsets with a U-Net based network 
. Maxpool and deconvolutional layers are used for downsampling and upsampling, respectively. Convolutional layer with stride
, zero paddings are designed to retain the feature size. With such a scheme, the spatial deformations with temporal dynamics in 3D MRI video clips can be simultaneously modeled.
Compared with the existing VQE approaches which achieve explicit motion compensation before fusion to alleviate the negative effects of temporal motion, TDAM implicitly focuses on cardiac border cues with position-specific sampling. As shown in Fig. 2, adjacent deformable convolution windows can independently sample the contents, achieving higher flexibility and performance boost.
2.2 Deformable Global Position Attention.
Regular convolution is restricted by the kernel size as well as the fixed geometric structures, which results in limited performance in modeling geometric transformations. In practice, it is very difficult to reduce false-positive predictions due to unclear borders between the cardiac instances. To address these issues, we propose a Deformable Global Position Attention (DGPA) network to capture a sufficiently large receptive field and semantic global contextual information. DGPA augments the spatial sampling locations in the modules with additional offsets, which is designed to model complex geometric transformations. Thus, long-range contextual information can be collected, which helps obtain more discriminative cardiac border for pixel-level prediction.
As shown in Fig. 1, the fused local features are regarded as input to the DGPA block, where represents the number of input channels, and indicate the height and width of the input features, respectively. We first feed the input features with a deformable convolution layer to capture cardiac geometric information. The formulation is as below:
where the feature map, is deformable convolution kernel, is the kernel size and is the deformable offset. The input feature map is reshaped to three new feature maps , where denotes the number of pixels (). In order to exploit the high-level features of cardiac borders, a dot-product is conducted between and the transpose of
. Then the result is applied into a softmax layer to calculate the attention map:
where represents the pixel’s impact on the pixel. The more similar feature representations of the two pixels indicate a stronger correlation between them. Then we perform a matrix multiplication between the transpose of and to reshape the result to . Finally, an element-wise sum operation is applied with the feature map from deformable block to obtain the output features as below:
is the scale parameter belonging to the position affinity matrix. Each element inis a weighted sum of the features globally and selectively aggregates input features . Long-range dependencies of the feature map are calculated to improve intra-class compact and semantic consistency.
3 Experiments and Results
We evaluate our proposal and competitive approaches on the publicly available data of ACDC MICCAI 2017 Challenge  with additional labeling done by experience radiologists . The dataset has right ventricle, myocardium, and left ventricle segmentation frames from MRI videos with labels provided by experienced radiologists. The collected images following the common clinical SSFP cine-MRI sequence have similar properties as 3D cardiac MRI videos. The MRI sequence, as a series of short-axis slices of end diastolic and end systolic instant, starts from the mitral valves down to the apex of the left ventricle. We resize the exams into 256 256 images, and no additional pre-processing was conducted. The dataset has 150 exams from different patients with 100 for training and 50 for testing.
The proposed method is implemented based on PyTorch library with reference to MMDetection toolbox for deformable convolution, using a NVIDIA GTX 1080Ti GPU. For the training set, standard data augmentation (i.e., mirror, axial flip or rotation) is further used to exploit training samples better. We use Adam optimizer to update the network parameters. The initial learning rate is set to and a weight decay of . We use a batch size of at least . The number of reference frames from Eq. 1 is set as
. The training is stopped if the Dice score does not increase by 20 epochs. In our experiments, we perform 5-fold cross-validation.
We compare the proposed DeU-Net with several state-of-the-art approaches for 3D Cardiac MRI Video segmentation: (1) kU-Net 
combined convolutional and recurrent neural networks to exploit the intra-slice and inter-slice contexts, respectively. (2) GridNet incorporated a shape prior whose registration on the input image is learned by the model, learning both high-level and low-level features. (3) Attention U-Net  proposed a novel self-attention gating module to learn irrelevant regions in an input image for dense label predictions. For a fair comparison, all these methods are modified and fully trained for 3D cardiac MRI video segmentation.
For quantitative evaluation, Table 1 details the comparison among U-Net, GridNet, Attention U-Net, kU-Net, and the proposed DeU-Net on average symmetric surface distance (ASSD), Hausdorff distance (HD), and Dice score. In addition to the scores at ED and ES phases, we evaluate the average score of the entire 3D cardiac MRI video for all the three metrics (with more detailed statistics reported in the supplementary materials due to the space limit). We can observe that DeU-Net significantly outperforms all the prior networks on most metrics. It is worth noting that our proposal substantially improves segmentation performance on RV, where the ventricle has complicated shape and intensity inhomogeneities. Note that, the proposed method achieves the best results on ASSD that is lower than GridNet by an average of mm. Compared to existing methods, HD is lower for our approach by an average of mm, and the Dice score is better by an average of . This is a strong indication that DeU-Net exploits spatio-temporal information and reduces the negative effect of cardiac border ambiguity.
To separate the contributions of TDAM and DGPA, we also evaluate the performance of three variants of DeU-Net: (1) DeU-Net(t), the variant without TDAM. (2) DeU-Net(d), the variant without DGPA. (3) ToFlow U-Net, the variant replacing TDAM by Task-oriented Flow (ToFlow) . For the differences among RV, MYO and LV, Table 1 shows that DeU-Net(d) works better than ToFlow U-Net and DeU-Net(t) on both HD and ASSD, indicating that TDAM can fully explore temporal correspondences across multiple frames, especially close to the borders. Moreover, DeU-Net also performs better over all the three variants, validating the necessity of having both TDAM and DGPA in the flow.
|Method||ASSD LV||ASSD MYO||ASSD RV||ASSD|
|Method||HD LV||HD MYO||HD RV||HD|
|Method||Dice LV||Dice MYO||Dice RV||Dice|
Finally, Fig. 3 illustrates a visual comparison of the groud truth, GridNet, Attention U-Net, ToFlow U-Net and DeU-Net in a 3D cardiac MRI video clip. It can be seen that GridNet and Attention U-Net successfully produce accurate results on most slices of each 3D volume, but the shape of the target region is not as accurate as ToFlow U-Net and DeU-Net. Such observations are especially apparent in the rows of 2, 3, and 4 of Fig. 3. Moreover, DeU-Net accurately extracts the borders of the target regions on most of the slices, as shown in the last row of Fig. 3. Note that, the segmentation performance of RV, labeled as blue, is significantly lower than that of MYO and LV due to the irregular shape and the ambiguous borders.
4 Discussions and Conclusions
In this paper, we propose a Deformable U-Net (DeU-Net) to fully exploit spatio-temporal information from 3D cardiac MRI video, including a Temporal Deformable Aggregation Module (TDAM) and a Deformable Global Position Attention (DGPA) network. Based on the temporal correlation across multiple frames, TDAM aggregates temporal information with learnable sampling offsets, and capture sufficient semantic context. To obtain the discriminative and compact features in subtle structures, the DGPA network encodes a wider range of multi-dimensional fused contextual information into global and local features. Experimental results show that our proposal achieves the state-of-the-art performance on commonly used metrics, especially for cardiac marginal information (ASSD and HD). In the future, it would be of interest to apply our proposal to other datasets (such as myocardial contrast echocardiography). Our segmentation method will facilitate the translation of neural networks to clinical practice.
This work was supported in part by National Key Research and Development Program Program of China [No. 2018YFE0126300], Key Area Research and Development Program of Guangdong Province [No. 2018B030338001], and Information Technology Center, Zhejiang University.
-  Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G., et al.: Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the problem solved? IEEE transactions on medical imaging 37(11), 2514–2525 (2018)
-  Chen, J., Yang, L., Zhang, Y., Alber, M., Chen, D.Z.: Combining fully convolutional and recurrent neural networks for 3d biomedical image segmentation. In: Advances in neural information processing systems. pp. 3036–3044 (2016)
-  Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Deng, J., Wang, L., Pu, S., Zhuo, C.: Spatio-temporal deformable convolution for compressed video quality enhancement. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 10696–10703 (2020)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 1725–1732 (2014)
-  Oktay, O., Schlemper, J., Folgoc, L.L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., et al.: Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999 (2018)
-  Peng, P., Lekadir, K., Gooya, A., Shao, L., Petersen, S.E., Frangi, A.F.: A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging. Magnetic Resonance Materials in Physics, Biology and Medicine 29(2), 155–195 (2016)
-  Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
-  Vick III, G.W.: The gold standard for noninvasive imaging in coronary heart disease: magnetic resonance imaging. Current opinion in cardiology 24(6), 567–579 (2009)
-  Wang, T., Xiong, J., Xu, X., Jiang, M., Yuan, H., Huang, M., Zhuang, J., Shi, Y.: Msu-net: Multiscale statistical u-net for real-time 3d cardiac mri video segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 614–622. Springer (2019)
-  Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. International Journal of Computer Vision 127(8), 1106–1125 (2019)
-  Yang, R., Xu, M., Wang, Z., Li, T.: Multi-frame quality enhancement for compressed video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6664–6673 (2018)
-  Zheng, H., Yang, L., Han, J., Zhang, Y., Liang, P., Zhao, Z., Wang, C., Chen, D.Z.: Hfa-net: 3d cardiovascular image segmentation with asymmetrical pooling and content-aware fusion. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 759–767. Springer (2019)
-  Zotti, C., Luo, Z., Humbert, O., Lalande, A., Jodoin, P.M.: Gridnet with automatic shape prior registration for automatic mri cardiac segmentation. In: International Workshop on Statistical Atlases and Computational Models of the Heart. pp. 73–81. Springer (2017)