1 Introduction
Realtime cine Magnetic Resonance Imaging (MRI) has enabled fast and accurate visual guidance in various cardiac interventions, such as aortic valve replacement [15], cardiac electroanatomic mapping and ablation [17], electrophysiology for atrial arrhythmias [22], intracardiac catheter navigation [7], and myocardial chemoablation [18]. In these applications, it is strongly desirable to segment the temporal frames onthefly, satisfying both throughput and latency requirements. The throughput should be at least above the cine MRI reconstruction rate of 22 frames per second (FPS) [19, 13]. The latency should be no more than 50 ms to avoid visually noticeable lags [2]. Most of the existing segmentation methods [14, 29, 28, 23, 27, 26], however, focus on accuracy. In order to handle cardiac border ambiguity and large variations among target objects from different patients, these methods come with high computation cost. Hence their inference latency and throughput are far from meeting the realtime requirements and thus can only be applied offline.
MSUNet [24] was proposed in MICCAI’19 as the first framework achieving the realtime segmentation of 3D cardiac cine MRI. It uses a canonical form distribution to describe the multiple input frames in a snippet of cine MRI so that only a single pass through the network is needed for all the frames in the snippet. While MSUNet increases the throughput drastically, the inference latency is also increased to well above 50 ms due to the need of input clustering, i.e., the inference is carried out only after all the frames in a snippet have arrived. When MSUNet is applied to realtime cine MRI segmentation, such significant visual lags jeopardize the effectiveness of visual guidance in cardiac intervention.
As a popular computational method for decomposing a multivariate signal into additive independent nonGaussian signals (bases), Independent Component Analysis (ICA) has been widely used in multiple image processing applications such as noise reduction [11], image separation in medical data [5, 3] and image decomposition [20]. Through the unmixing process in ICA, any image patch out of a given image can be represented by a linear combination of a set of independent bases of the same size as the image patch. In the mixing process, the original image can be reconstructed using the bases with proper coefficients.
In this paper, based on a new interpretation of ICA for learning (Section 2
), we propose ICAUNet, a novel model that can not only achieve highly accurate 3D cardiac cine MRI segmentation results, but also attain both high throughput and low latency. Specifically, an input temporal frame in the cine MRI is decomposed into independent bases and a mixing tensor, composed of the coefficient tensors of all the bases, by a lightweight ICAencoder. Such an ICAencoder mimics the unmixing operation in ICA. A UNet like architecture is trained to learn the transform of the mixing tensor from its original function of image reconstruction to the target function of image segmentation. As such, the transformed mixing tensors can be mixed with the bases through lightweight ICAdecoders to get the desirable features for final segmentation evaluation. Because the coefficient tensors that compose the mixing tensor are much smaller in size than the input frame and can be processed in parallel due to the independence between their corresponding bases, significant latency reduction can be achieved.
Experiment results show that, compared with the stateoftheart realtime cardiac cine MRI segmentation method MSUNet, ICAUNet achieves much higher Dice scores for all cardiac classes with up to 12.6 latency reduction. More specifically, the latency of ICAUNet is below 50 ms while its throughput is still above 22 FPS, which implies that ICAUNet is the first method meeting the realtime performance requirements in terms of both throughput and latency for MRI guided cardiac intervention with no visually noticeable lags. In fact, the accuracy achieved by ICAUNet is on a par with stateoftheart methods that focus on accuracy and can only run offline because of their complexity.
2 Motivation
It has long been recognized that Independent Component Analysis (ICA) can be used to extract features (bases) from images [9]. Following the similar setup as [16], we can partition an image into a set of smaller image patches such that each image patch can be represented as a linear combination of independent basis image patches along with their coefficients. Compactly put, for a set of input image patches
where each row vector of
represents an input image patch, the goal of ICA is to estimate the unmixing matrix
such that the realizations of bases are as mutually independent as possible (which is called the unmixing process), while the reconstruction of input image patches, , is as close to as possible (which is called the mixing process). Matrix is called the mixing matrix, which equals the pseudo inverse of . There are different ICA algorithms and implementations, and one popular implementation is FastICA[11]. In the rest of the paper, we extend the matrix term to tensor as mixing tensor and basis tensor , due to the multidimension nature of the input images. Each channel in represents a basis and the number of channels equals the basis dimension of ICA. The corresponding coefficient tensor of each basis can be extracted from the mixing tensor . We will also use basis tensor and bases interchangeably for the simplicity of discussion.We can have an interesting interpretation of ICA: Both the mixing tensor and the bases can be considered as some kind of feature representation of the input image, with the bases being more fundamental to the input image while the mixing tensor being more related to a particular application such as reconstruction of the input image. In other words, bases can be treated as wellbehaved image features that can be reused for different applications, while the mixing tensor can be treated as weights of a simple fully connected layer used to reconstruct the input image.
With such an insight, we wonder if we can learn to transform the mixing tensor so that it can be utilized for a different set of applications (with the help of the bases) beyond the original image reconstruction. As illustrated in Fig. 1
(a), the original mixing tensor for image reconstruction is transformed to that for target application, which, after mixing together with the bases, can be used to get the desirable target features for final evaluation of the target application. As the bases are shared, only the mixing tensor, which is composed of the coefficient tensors of all the bases, needs to be transformed. During this process, the coefficient tensors of different bases can be computed in parallel due to the independence between the bases, and each of which is much smaller than the original input. Thus significant latency reduction can be achieved. Since the mixing tensor still exhibits spatial patterns, a conventional image oriented deep neural networks such as UNet can be used as the backbone to learn the transform.
In conventional ICA, the unmixing operation is lossy, which affects the downstream application (such as image reconstruction) accuracy, and runs as a separate optimization process, which can be quite timeconsuming. Since the learning of the target mixing tensor is also an optimization problem, why not combining the ICA process with the learning process as a joint endtoend training process so that we can not only mitigate the impact of lossy unmixing operation on accuracy, but also reduce one separate optimization process? Such a motivation drives us to propose a lightweight neural network based ICA encoder and decoder to mimic the unmixing and mixing operations in ICA. We further integrate them with a UNet backbone so that they can be endtoend trained.
3 Method
Driven by the motivation in Section 2, a conceptual illustration of our ICAUNet is shown in Fig. 1(b). Its detailed architecture is shown in Fig. 2, where we use superscript to denote the time stamp of input frame (e.g. ). A summary of all the notations used in this section is also included in the figure. ICAUNet is mainly made of four types of modules: the ICAencoder, the contracting blocks (), the expanding blocks (), and the ICAdecoders.
is the number of contracting/expanding blocks acting as a hyperparameter that can affect accuracy and speed, as will be shown in the experiments.
1) ICAencoder: The ICAencoder extracts both the statistically independent basis tensor and the associated initial mixing tensor from the input image. Instead of running a standard ICA process where the image needs to be explicitly partitioned and an explicit iterative optimization is needed to obtain the mixing tensor and the realization of the bases, we propose to use a neural network to obtain them as a function of the input image.
For an input frame size of (,,) for depth, height, and width, respectively, we can choose the basis dimension (in this paper ), the size of the initial mixing tensor as (1,,,,), and the size of independent basis tensor as (1,,,,), where is the output channel width of the transposed convolution between the concatenated mixing tensor and in the ICAdecoder ( in this paper). Each channel in corresponds to an independent basis. The corresponding coefficient tensor of each basis can be obtained from the mixing tensor , each of size (1,1,,,). The extraction of and
shares some layers for low level feature extraction before they are split channelwise in order to reduce the computation.
After both and are obtained, is forwarded to the following contracting block , and is directly forwarded to each ICAdecoder block for mixing operation.
The objective function of ICAencoder, which is used for regulating the optimization towards sparsity, independence and accuracy, can be expressed as
(1) 
where , and are the weights of the loss terms which are set to 1.0 in our experiments. The first term reflects sparsity through L1 norm. The second term reflects independence through negentropy [10]; is a constant number between and (we take in our experiments); denotes elementwise average. The third term reflects reconstruction loss. We adopt transposed convolution () as the mixing operation, so the L2 distance between the reconstructed frame and the original frame should be minimized.
2) Contracting blocks: The contracting blocks of ICAUNet are designed to further propagate the mixing tensor , and generate the learned ones in a multiresolution manner. As shown in Fig. 2, the contracting part is made of contracting blocks, ranging from to . The contracting block () takes
as input, propagates it through a downsampling module (i.e., convolution with stride 2) and convolution modules (i.e., conv) sequentially, and outputs
, which is then forwarded to the next contracting block as well as the corresponding expanding block .3) Expanding blocks: The expanding blocks () are designed to process the features generated by the contracting blocks and forward the outputs to ICAdecoder blocks and the next expanding block (or concatenation block for ). An expanding block has two subtasks: upsampling the mixing tensor to (by upsample block ), and calculating dimension aligned correlation features from the neighbour frames (by concatenation block ). Note we use to represent the mixing tensor in the upsampling path, distinguishing it from those in the downsampling path .
We use transposed convolution to achieve the upsampling on , during which we obtain the outputs with various resolutions. Taking , , , and , , from the outputs of contracting blocks for frames and , respectively, we can calculate the temporal correlation features following [12] between mixing tensors and , and between and (). The obtained correlation features explicitly provide matching information from the neighbouring frames for more accurate segmentation. The correlation features are then concatenated with , and forwarded through a convolution module with kernel (convbnleakyrelu) for dimension reduction. These computations are processed in the concatenation block as shown in Fig. 2.
4) ICAdecoder: The ICAdecoder block is designed to mimic the mixing operation between the concatenated mixing tensor () and the basis , as in the standard ICA, to generate the output for evaluation. As discussed earlier, transposed convolution is used as the mixing operation. The transposed convolution acts as both upsampling for evaluation and the multiplication projection field between and each value in , which helps reducing both the parameter size and the computation load.
After mixing, the mixed features are propagated through a convolution module as the output for evaluation. From to we can obtain a total of multiresolution segmentation outputs for evaluations, denoted as , . After we get from , we forward , , , , and to the final concatenation block and its corresponding ICAdecoder, where we obtain the output with the same size as the original input. Thus, we obtain a total of outputs in multiresolution for evaluation.
Objective function: We evaluate the outputs from the decoder blocks with the multiresolutions groundtruth. The overall objective function is
(2) 
where () is the evaluation loss of the multiresolution outputs in the decoder (with Crossentropy); is the corresponding loss weights, while we take for and ; is the loss from Equation (1), and the weight is set to in our experiments. Crossentropy is used for calculating with the rescaled versions of ground truth.
3.0.1 Latency reduction analysis
For the inference of a frame of size , the mixing tensor , as the input to the backbone, is composed of the coefficient tensors, each of size , which is smaller than the original frame size for a regular UNet. In addition, these coefficient tensors can be handled in parallel (i.e., a new task parallelism) due to the independence between the bases. Note that the processing of each coefficient tensor can still utilize any existing parallelization techniques such as model or operator parallelization [8, 6, 4, 21] by applying them on the backbone. Therefore, significant latency reduction can be achieved.
4 Experiments
Experiment Setup: We evaluate our model on an extended ACDC MICCAI 2017 challenge dataset made available by MSUNet [24] with labels on all the frames in the training data. We compare our ICAUNet with MSUNet, the stateoftheart realtime cine MRI segmentation method on the same dataset [24]. We perform 5fold crossvalidation and use the average Dice score (the higher the better) to evaluate the segmentation accuracy. To further see how the accuracy of ICAUNet compare against the stateoftheart offline segmentation methods that achieves high accuracy but looses realtime performance, we evaluate the test data by submitting the segmentation results of ED and ES instants to ACDC online evaluation platform [1].
All the methods were implemented in PyTorch and trained from scratch with the same hyperparameters and optimizer setting. MSUNets were based on the implementations by
[24]. All the networks are fully parallelized using CUDA/CuDNN[4]. All experiments run on a machine with 16 cores of Intel Xeon E52620 v4 CPU, 256G memory, and an NVIDIA Tesla P100 GPU.Methods  Dice score  TP  LT  
RV  MYO  LV  Average  (FPS)  (ms)  
MSUNet(span=10)  .837.034  .811.049  .854.040  .834.020  70.2  442  
MSUNet(span=5)  .855.026  .836.022  .897.017  .862.011  43.2  249  
MSUNet(span=3)  .858.034  .838.039  .898.034  .864.030  29.4  169  
MSUNet(span=2)  .860.017  .837.031  .901.021  .867.020  21.7  125  
ICAUNet(n=3)  .900.023  .869.027  .934.013  .901.017  31.6  35  
ICAUNet(n=4)  .921.017  .888.034  .952.015  .920.019  28.3  39 
Methods  Dice score  Hausdorff (mm)  

RV  MYO  LV  RV  MYO  LV  
Ensemble UNet  0.923  0.911  0.950  11.13  8.69  7.15  
Net  0.920  0.891  0.954  N/A  N/A  N/A  
GridNet  0.910  0.894  0.938  11.80  9.45  7.30  
ICAUNet(n=4)  0.920  0.890  0.940  11.91  7.93  6.85 
Performance on realtime segmentation: The results of ACDC 3D cardiac cine MRI segmentation are shown in Table 1. We can see that ICAUNet increases the Dice score by 0.061(RV), 0.050(MYO), 0.051(LV), and 0.053(average), respectively, compared with the best results achieved by MSUNets. ICAUNets also achieve smaller Dice score variations than MSUNets in most cases. In terms of throughput, although both ICAUNets and MSUNets can satisfy the realtime requirement of 22 FPS, only ICAUNets can meet realtime latency requirement (below 50 ms), up to 12.6 faster than MSUNets. In summary, ICAUNet not only achieves the best Dice score, but also is the only realtime segmentation method that can simultaneously meet the realtime throughput and latency requirements for visual guidance of cardiac interventions.
From the table, we can also see that the number of convolutional decoder blocks is an effective tuning knob for Dice and speed tradeoff. A higher number of blocks result in higher Dice scores at the cost of slightly reduced throughput and increased latency. Visualization of segmentation results by ICAUNet along with the corresponding ground truth is shown as Fig. 3.
Accuracy v.s. stateoftheart offline methods: To see how the accuracy of ICAUNet compares with stateoftheart offline segmentation methods which do not satisfy realtime requirements, we further verify our ICAUNet on ED and ES instants of ACDC test data. The evaluation results reported by [1], in terms of both dice score and Hausdorff distance, are shown in Table 2. The results from the best approaches in the literature, including GridNet [29], Net [23], and ensemble UNet [14], are also included for reference. With complex network structures, the latency and throughput of these methods are far from the realtime requirements, as shown in [25]. In contrast, we see that the accuracy of ICAUNet comes very close to these stateoftheart results while meeting the realtime throughput and latency requirements.
5 Conclusions
Inspired by ICA, ICAUNet decomposes temporal frames in 3D cardiac cine MRI into independent bases and the corresponding coefficient tensors, which are much smaller in size and help to learn better. Experimental results show that compared with the stateofthearts, ICAUNet is the only 3D cardiac cine MRI segmentation method that can satisfy both realtime throughput and latency requirements with comparable (if not better) accuracy.
References
 [1] ACDC challenge, https://www.creatis.insalyon.fr/Challenge/acdc/
 [2] Annett, M., Ng, A., Dietz, P., Bischof, W., Gupta, A.: How low should we go? understanding the perception of latency while inking. In: 2014 Graphics Interface. pp. 167–174 (2014)
 [3] Bronstein, A.M., Bronstein, M.M., Zibulevsky, M., Zeevi, Y.Y.: Sparse ICA for blind separation of transmitted and reflected images. International Journal of Imaging Systems and Technology 15(1), 84–91 (2005)
 [4] Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., Shelhamer, E.: cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014)
 [5] Delorme, A., Sejnowski, T., Makeig, S.: Enhanced detection of artifacts in EEG data using higherorder statistics and independent component analysis. Neuroimage 34(4), 1443–1449 (2007)
 [6] Dryden, N., Maruyama, N., Benson, T., Moon, T., Snir, M., Van Essen, B.: Improving strongscaling of cnn training by exploiting finergrained parallelism. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). pp. 210–220. IEEE (2019)
 [7] Gaspar, T., Piorkowski, C., Gutberlet, M., Hindricks, G.: Threedimensional realtime mriguided intracardiac catheter navigation. European heart journal 35(9), 589–589 (2014)
 [8] Gholami, A., Azad, A., Jin, P., Keutzer, K., Buluc, A.: Integrated model, batch, and domain parallelism in training neural networks. In: Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures. pp. 77–86 (2018)
 [9] Hoyer, P.O., Hyvärinen, A.: Independent component analysis applied to feature extraction from colour and stereo images. Network: computation in neural systems 11(3), 191–210 (2000)
 [10] Hyvärinen, A., Karhunen, J., Oja, E.: Independent component analysis, vol. 46. John Wiley & Sons (2004)
 [11] Hyvärinen, A., Oja, E.: Independent component analysis: algorithms and applications. Neural networks 13(45), 411–430 (2000)

[12]
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2462–2470 (2017)
 [13] Iltis, P.W., Frahm, J., Voit, D., Joseph, A.A., Schoonderwaldt, E., Altenmüller, E.: Highspeed realtime magnetic resonance imaging of fast tongue movements in elite horn players. Quantitative imaging in medicine and surgery 5(3), 374 (2015)
 [14] Isensee, F., Jaeger, P.F., Full, P.M., Wolf, I., Engelhardt, S., MaierHein, K.H.: Automatic cardiac disease assessment on cineMRI via timeseries segmentation and domain specific features. In: International workshop on statistical atlases and computational models of the heart. pp. 120–129. Springer (2017)
 [15] McVeigh, E.R., Guttman, M.A., Lederman, R.J., Li, M., Kocaturk, O., Hunt, T., Kozlov, S., Horvath, K.A.: Realtime interactive mriguided cardiac surgery: Aortic valve replacement using a direct apical approach. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 56(5), 958–964 (2006)
 [16] Olshausen, B.A., Field, D.J.: Natural image statistics and efficient coding. Network: computation in neural systems 7(2), 333–339 (1996)
 [17] Radau, P.E., Pintilie, S., Flor, R., Biswas, L., Oduneye, S.O., Ramanan, V., Anderson, K.A., Wright, G.A.: Vurtigo: visualization platform for realtime, mriguided cardiac electroanatomic mapping. In: International Workshop on Statistical Atlases and Computational Models of the Heart. pp. 244–253. Springer (2011)
 [18] Rogers, T., Mahapatra, S., Kim, S., Eckhaus, M.A., Schenke, W.H., Mazal, J.R., CampbellWashburn, A., Sonmez, M., Faranesh, A.Z., Ratnayaka, K., et al.: Transcatheter myocardial needle chemoablation during realtime magnetic resonance imaging: a new approach to ablation therapy for rhythm disorders. Circulation: Arrhythmia and Electrophysiology 9(4), e003926 (2016)
 [19] Schaetz, S., Voit, D., Frahm, J., Uecker, M.: Accelerated computing in magnetic resonance imaging: Realtime imaging using nonlinear inverse reconstruction. Computational and mathematical methods in medicine 2017 (2017)
 [20] Starck, J.L., Elad, M., Donoho, D.L.: Image decomposition via the combination of sparse representations and a variational approach. IEEE transactions on image processing 14(10), 1570–1582 (2005)
 [21] Vasudevan, A., Anderson, A., Gregg, D.: Parallel multi channel convolution using general matrix multiplication. In: 2017 IEEE 28th International Conference on Applicationspecific Systems, Architectures and Processors (ASAP). pp. 19–24. IEEE (2017)
 [22] Vergara, G.R., Vijayakumar, S., Kholmovski, E.G., Blauer, J.J., Guttman, M.A., Gloschat, C., Payne, G., Vij, K., Akoum, N.W., Daccarett, M., et al.: Realtime magnetic resonance imaging–guided radiofrequency atrial ablation and visualization of lesion formation at 3 tesla. Heart Rhythm 8(2), 295–303 (2011)
 [23] Vigneault, D.M., Xie, W., Ho, C.Y., Bluemke, D.A., Noble, J.A.: net (omeganet): fully automatic, multiview cardiac mr detection, orientation, and segmentation with deep neural networks. Medical image analysis 48, 95–106 (2018)
 [24] Wang, T., Xiong, J., Xu, X., Jiang, M., Yuan, H., Huang, M., Zhuang, J., Shi, Y.: MSUNet: Multiscale statistical unet for realtime 3d cardiac mri video segmentation. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. pp. 614–622. Springer (2019)
 [25] Wang, T., Xiong, J., Xu, X., Shi, Y.: SCNN: A general distribution based statistical convolutional neural network with application to video object detection. arXiv preprint arXiv:1903.07663 (2019)
 [26] Xu, X., Lu, Q., Yang, L., Hu, S., Chen, D., Hu, Y., Shi, Y.: Quantization of fully convolutional networks for accurate biomedical image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8300–8308 (2018)
 [27] Xu, X., Wang, T., Shi, Y., Yuan, H., Jia, Q., Huang, M., Zhuang, J.: Whole heart and great vessel segmentation in congenital heart disease using deep neural networks and graph matching. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. pp. 477–485. Springer (2019)
 [28] Yan, W., Wang, Y., Li, Z., Van Der Geest, R.J., Tao, Q.: Left ventricle segmentation via opticalflownet from shortaxis cine mri: preserving the temporal coherence of cardiac motion. In: International Conference on Medical Image Computing and ComputerAssisted Intervention. pp. 613–621. Springer (2018)

[29]
Zotti, C., Luo, Z., Lalande, A., Jodoin, P.M.: Convolutional neural network with shape prior applied to cardiac MRI segmentation. IEEE journal of biomedical and health informatics (2018)