1 Introduction
Realtime Magnetic Resonance Imaging (MRI) techniques have been providing fast and accurate visual guidance in multiple fields. The duration of cardiac surgery (e.g., prosthetic valve implantation in the correct location at the aortic annulus) has been significantly shortened since interactive realtime MRI was getting applied [6]. Interventional realtime MRI has also been adopted for congenital, ischemic, and structural heart disease for its capacity of visualizing 3D anatomy and assessing myocardial tissue as well as local hemodynamics [2]. To achieve realtime MRI guidance, the images need to be segmented onthefly, at a speed of at least 30, preferably up to 100 frames per second (FPS) [9, 3].
However, performing realtime segmentation on cardiac MRI images is a challenging task. In addition to the difficult effects such as anisotropic resolution, cardiac border ambiguity and large variations among targeting objects from patients [11]
, the requirement of realtime fast segmentation demands a lightweight and efficient processing framework. Existing approaches used complicated neural network architectures to achieve good accuracy and were not able to make inference in real time
[4, 12]. Recently, Statistical Convolutional Neural Network (SCNN) was proposed to speed up conventional CNNs with little performance loss in video object detection
[10]. Instead of feeding the input samples as deterministic values, SCNN used Independent Component Analysis (ICA) to extract parameterized statistical distributions in canonical form to compactly model the temporally and contextually correlated information. Then the network model propagated the distributions in canonical form more efficiently than deterministic values.
In this work, we propose Multiscale Statistical UNet (MSUNet) for realtime cardiac MRI segmentation. We incorporate SCNN and a new multiscale data sampling method with the UNet to capture spatiotemporal correlation in the input data. Our model adopts a parallel architecture to efficiently propagate the multiscale distributions. Specifically, we apply ICA with multiple sets of temporal image patches to generate a cluster of canonical form distributions, each of which represents a different scale to model the input data. This multiscale sampling method can preserve the information of spatiotemporal correlations at different scales. Then we implement a number of parallel yet lightweight encoderdecoder style branches for efficient inference. Each branch propagates the specific scale of canonical form distributions. Experimental results show that our MSUNet achieves up to 268% and 237% speedup with 1.6% and 3.6% increased Dice scores compared with vanilla UNet and a modified stateoftheart method GridNet [12].
2 Background
SCNN [10] was the first model that feeds CNNs with a reasonable number of statistical distributions that were decomposed from the input data. SCNN is lighter and thus of higher speed than conventional CNNs that conduct deterministic operations (such as sum and max).
SCNN applied ICA to decompose video frames that exhibit spatiotemporal correlation into canonical form distributions as follows:
(1) 
where (1) is a random multivariate signal, which in video object detection represents the same pixel across multiple frames in a snippet; (2) is the mean value of ; (3) are additive independent subcomponents of ; (4) are the corresponding weight act as mixing matrix; (5) denotes uncorrelated Gaussian noise and is the weight of ; (6) is the basis dimension of the canonical form distribution.
With the help of predefined core operations (weighted sum and max) that keep their outputs still in canonical form distributions, SCNN needs little modification to the standard gradient descent based scheme. It can be trained using the same forward and back propagation procedures as conventional CNNs. At the output, the results are mixed to form a temporal feature map for each sample by plugging in the values of independent sources from the ICA process. By processing multiple frames at a time through distributions, SCNN significantly speedups object detection in videos over conventional CNNs with slight accuracy degradation.
3 Method
In this section, we first present a multiscale sampling method to extract canonical form distributions from input 3D MRI videos. Then we introduce the architecture of MSUNet and explain how it processes these distributions for realtime segmentation.
3.1 Distribution Extraction with Multiscale Data Sampling
In order to build linear distributions in parameterized canonical form (Equation 1) via ICA, we need to decide how to properly extract samples from 3D MRI video to feed into ICA, i.e., what information each should represent. In the approach of SCNN for video object detection [10], the video clips are resized and split into small snippets, and each distribution models the same pixel across multiple frames in the same snippet. However, this cannot be applied to 3D MRI video directly since lots of semantic details important to segmentation would be lost. Thus, we propose to use to represent a patch within a small range (both spatially and temporally) where strong correlation exists. Specifically, we denote the dimension of an input 3D MRI video as [, , , ], where plane is the short axis plane, is the short axis and is the temporal dimension. The common issues of slice shifting as well as large interslice gap in MRI cardiac images along shortaxis ( axis) [12] lead to minimum spatial correlation in and planes. Therefore, we extract patches within the dimension [, , ], independent of .
Before extracting the patches, the 3D MRI videos are normalized first to remove offsets among videos. Each patch is then extracted using a window of size on the plane over time steps. We call as snippet span. We propose to allow different canonical forms to have different and , as such an approach covers potential spatiotemporal correlations at different scales. We call this cluster of distributions with multiple patch sizes as multiscale distributions. An example of the extraction process of multiscale distributions with different patch sizes on one slice is shown in Fig. 1
(a). The patches are collected at the same position over time and fed to ICA to extract canonical form distributions. ICA has to be used because the propagation of the canonical form distributions requires all the bases to be independent. Other approahces such as PCA cannot guarantee this unless the samples follow Gaussian distributions, which is not the case in our problem. As a result, the snippet of 3D MRI video is “collapsed” into a smaller 3D image, with each voxel representing a canonical form distribution (Equation
1) that has both spatial and temporal correlations: with patch size and predetermined independent basis dimension , a compression ratio of is achieved.To show the feasibility of the proposed data sampling, we extract the multiscale canonical form distributions using our procedure with various compression ratio by changing the basis dimension with and . The visual results along with the compression ratio are shown in Fig. 1 (b). With a larger ratio (, smaller basis dimension of canonical form distribution), the restored video gain more noise with vague contours, which would bring obstruction to the segmentation task. With a smaller ratio (), the difference between the input video and the restored mixing video is negligible. Therefore, we adopt as the compression ratio in our following experiments.
3.2 Realtime Segmentation with MSUNet
The multiscale canonical form distributions provide compact data representation for efficient processing. In this subsection, we explore a parallel structure, namely MSUNet, that can further speedup the segmentation.
Fig. 2 illustrates our MSUNet which consists of multiple DownTubes (DTs), UpTubes (UTs), Center blocks, and a final evaluator (FE). The DTs and UTs act as the encoders and decoders in UNet for feature propagation. Multiple DTs are built for a set of splitting patch sizes, each consisting of multiple blocks with downscaling convolution layers to perform feature downscaling and reuse. The ICA process and the corresponding mixing operations are done before and after the operations in DT, repsectively, and the operations in DT are performed in canonical form distributions similar to the work developed in SCNN [10]. The features in UT are propagated and upscaled with the blocks made of convolutional layers, and transposed convolutional layers, respectively. The features after each upscaling are concatenated with the one skipped from DT for feature reuse. After the outputs are obtained from UTs with various patch sizes, all features would have the same dimensions, which are then concatenated and forwarded to the final evaluator to generate the final output. The number of blocks in DT/UT varies to accommodate the input dimensions of 3D images.
4 Experiments
4.1 Experiment Setup
The evaluation task is to segment right ventricle (RV), myocardium (MYO), and left ventricle (LV) from MRI video clips in real time. We evaluate the proposed MSUNet and competitive baselines on segmenting the RV, MYO and LV from the frames of End Diastolic (ED) and End Systolic (ES) instant. These frames were collected from the ACDC MICCAI 2017 challenge dataset [1] with additional labeling done by experience radiologists. These frames have similar properties as 3D cardiac MRI videos. The dataset has 150 exams from different patients with 100 for training and 50 for testing. The images were collected following the common clinical SSFP cineMRI sequence with a series of shortaxis slices starting from the mitral valves down to the apex of the left ventricle. We perform 5fold crossvalidation and use the Dice score to evaluate the segmentation accuracy.
We implement two versions of MSUNet with specific snippet spans ( and , denoted as T5, T10, respectively) for evaluation. The ICA processing time is included when we evaluate the inference time of MSUNet. Existing approaches have reported their FPS on the same dataset: [4], and [12]. Clearly, none of them can perform realtime inference (i.e., at least 30 FPS). Therefore, we modify and rebuild these approaches to speed them up so that they can be compared with MSUNet on a relatively fair basis. We implement a set of shallower/slimmer versions (i.e., with fewer layers/fewer channels) of the models. Specifically, we modify the 2D UNet [7] to a shallow version with a depth of 3 and initial filter of 8 or 16. We denote them as D3+IF8 and D3+IF16, respectively. The 2D UNet with vanilla configuration (D5+IF64) is also included. We also modify GridNet to shallower versions with a depth of 2 or 3 and initial filter of 32, denoted as D2+IF32 and D3+IF32, respectively. The vanilla version of GridNet is one of the best models in the ACDC 2017 challenge. All these methods are fully trained after modification.
In our experiments, we do not include 3D UNet for comparison due to its excessive memory consumption, unbalanced input dimensions, and slow inference speed [7, 4]. Meanwhile we do not include lightweight networks such as ShuffleNet [5] or MobileNet [8] which is designed with small memory footprint for mobile devices in image classification/object detection rather than medical segmentation, while inference speed is not their primary concern (which mainly depends on the network depth). We have tried ShuffleNet/MobileNet in our experiment settings and the speeds are only at 8.45/11.58 FPS, which are slower than the nets we reported.
We implement MSUNet and 2D UNets using PyTorch. The GridNet was implemented using TensorFlow
[12]. All experiments run on a machine with 16 cores of Intel Xeon E52620 v4 CPU, 256G memory, and an NVIDIA GeForce GTX 1080 GPU.Methods  FPS  Dice score  

RV  MYO  LV  Average  
GridNet (D3+IF32)  15.7  .842.028  .804.026  .901.036  .849.014 
UNet (D5+IF64)  16.1  .865.036  .761.039  .911.026  .846.025 
GridNet (D2+IF32)  18.2  .815.025  .812.014  .851.033  .826.011 
UNet (D3+IF16)  33.2  .564.071  .738.045  .767.026  .690.036 
UNet (D3+IF8)  43.2  .5520.079  .674.060  .759.059  .662.058 
MSUNet (T5)  43.2  .855.026  .836.022  .897.017  .862.011 
MSUNet (T10)  70.2  .837.034  .811.049  .854.040  .834.020 
4.2 Results
Table 1 presents the comparison among UNet, GridNet, and the proposed MSUNet on Dice score and FPS. Our MSUNets can achieve the fastest processing speed (highest FPS) and the best Dice score. Compared with the fastest baseline method UNet (D3+IF8), our MSUNet (T10) runs 1.63 faster and makes an improvement of 26% on segmentation accuracy. Compared with the most accurate baseline method (with the highest Dice score) GridNet (D3+IF32), our MSUNet (T5) can achieve a slightly higher accuracy and 2.75 faster processing speed. From the table, it is clear that MSUNets are the only capable method to segment realtime 3D MRI videos.
For MSUNet, a bigger video snippet span (T10) can obtain a faster processing speed with a slight accuracy degradation (only 0.028). However, for UNet, when it is modified into shallow/slim versions such as UNet (D3+IF16) and (D3+IF8) for realtime processing ( FPS), the accuracy degrades significantly: We observe that the accuracy drops from 0.846 to 0.690 and 0.662, respectively. We observe the same pattern for GridNet, and conclude that MSUNet can achieve a stable accuracy when configured for segmentation in real time.
Finally, Fig. 3 shows the examples of MSUNet segmentation results at various time steps. Note that our MSUNet can accurately segment the target areas. The boundaries are clearly extracted on most of the slices. In the base and middle slices, the segmentation fits the contours of targets. In some of the apex slices, the segmentation of RV (labeled in blue) is not as accurate as MYO and LV, because of the unclear boundaries between the instances.
5 Conclusions
In this paper, we proposed Multiscale Statistical UNet (MSUNet) for realtime 3D cardiac MRI video segmentation. Based on the scheme of Statistical Convolutional Neural Network, we model the input samples as multiscale canonical form distributions for speedup, while the spatiotemporal correlationis still fully utilized. A parallel statistical UNet is then proposed to process these multiscale distributions efficiently. On the 3D cardiac MRI videos from the ACDC MICCAI 2017 dataset, MSUNet achieves up to 268% and 237% speedup with 1.6% and 3.6% increased Dice scores compared with vanilla UNet and a modified stateoftheart method GridNet, respectively.
6 Acknowledgement
This work was approved by the Research Ethics Committee of Guangdong General Hospital, Gunagdong Academy of Medical Sciences with protocol No. 20140316. This work was supported by the National key Research and Development Program [2018YFC1002600], Science and Technology Planning Project of Guangdong Province, China [No. 2017A070701013, 2017B090904034, 2017030314109, and 2019B020230003], National Science Foundation grant [CCF1919167], and Guangdong peak project [DFJH201802].
References

[1]
Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G., et al.: Deep learning techniques for automatic mri cardiac multistructures segmentation and diagnosis: Is the problem solved? IEEE transactions on medical imaging
37(11), 2514–2525 (2018)  [2] CampbellWashburn, A.E., Tavallaei, M.A., Pop, M., Grant, E.K., Chubb, H., Rhode, K., Wright, G.A.: Realtime mri guidance of cardiac interventions. Journal of Magnetic Resonance Imaging 46(4), 935–950 (2017)
 [3] Iltis, P.W., Frahm, J., Voit, D., Joseph, A.A., Schoonderwaldt, E., Altenmüller, E.: Highspeed realtime magnetic resonance imaging of fast tongue movements in elite horn players. Quantitative imaging in medicine and surgery 5(3), 374 (2015)
 [4] Isensee, F., Jaeger, P.F., Full, P.M., Wolf, I., Engelhardt, S., MaierHein, K.H.: Automatic cardiac disease assessment on cinemri via timeseries segmentation and domain specific features. In: International workshop on statistical atlases and computational models of the heart. pp. 120–129. Springer (2017)

[5]
Ma, N., Zhang, X., Zheng, H.T., Sun, J.: Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 116–131 (2018)
 [6] McVeigh, E.R., Guttman, M.A., Lederman, R.J., Li, M., Kocaturk, O., Hunt, T., Kozlov, S., Horvath, K.A.: Realtime interactive mriguided cardiac surgery: Aortic valve replacement using a direct apical approach. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 56(5), 958–964 (2006)
 [7] Ronneberger, O., Fischer, P., Brox, T.: Unet: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computerassisted intervention. pp. 234–241. Springer (2015)

[8]
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4510–4520 (2018)
 [9] Schaetz, S., Voit, D., Frahm, J., Uecker, M.: Accelerated computing in magnetic resonance imaging: Realtime imaging using nonlinear inverse reconstruction. Computational and mathematical methods in medicine 2017 (2017)
 [10] Wang, T., Xiong, J., Xu, X., Shi, Y.: Scnn: A general distribution based statistical convolutional neural network with application to video object detection. arXiv preprint arXiv:1903.07663 (2019)
 [11] Zheng, Q., Delingette, H., Duchateau, N., Ayache, N.: 3d consistent & robust segmentation of cardiac images by deep learning with spatial propagation. IEEE Transactions on Medical Imaging (2018)
 [12] Zotti, C., Luo, Z., Lalande, A., Jodoin, P.M.: Convolutional neural network with shape prior applied to cardiac mri segmentation. IEEE journal of biomedical and health informatics (2018)
Comments
There are no comments yet.