1 Introduction
MRI is the gold standard modality for cardiac assessment [1, 2]. In particular, the use of kinetic MR images along the short axis orientation of the heart allows accurate evaluation of the function of the left and right ventricles. For these examinations, one has to delineate the left ventricular endocardium (LV), the left ventricular epicardium (or myocardium  MYO) and the right ventricular endocardium (RV) in order to calculate the volume of the cavities in diastole and systole (and thus the ejection fraction), as well as the myocardial mass [3]. These parameters are mandatory to detect and quantify different pathologies.
As of today, the clinical use of cardiovascular MRI is hampered by the amount of data to be processed (often more than 10 short axis slices and more than 20 phases per slice). Since the manual delineation of all 3D images is clinically impracticable, several semiautomatic methods have been proposed, most of which being based on active contours, dynamic programming, graph cut or some atlas fitting strategies [4, 5, 3, 6, 7]. Unfortunately, these methods are far from real time due to the manual interaction which they require. Also, most of them are illsuited for segmenting simultaneously the LV, the RV, and the MYO.
So far, a limited number of fullyautomatic cardiac segmentation methods have been proposed. While some use traditional image analysis techniques like the Hough transform [8] or level sets [9]
, fullyautomatic segmentation methods are usually articulated around a machine learning method
[4], and more recently deep learning (DL) and convolutional neural networks (CNN)
[10]. Among the best CNN segmentation models are those involving a series of convolutions and pooling layers followed by one [11] or several [12] deconvolution layers. In 2015, Ronneberger et al. [13] proposed the UNet, a CNN which involves connections between the conv and deconv layers and whose performances on medical images are astonishing. Recently, DL methods have been proposed to segment cardiac images [10, 14, 15]. While Tran [10] applied the wellknown fCNN [11] on MR cardiac images, Tan et al. [14] used CNN to localize (but not segment) the LV, and Ngo et al. [15] use deep belief nets again to localize but not segment the LV. These methods are doing only one or two class segmentation and do not incorporate the shape prior inside the network.In this paper we propose the first CNN method specifically designed to segment the LV, RV and MYO without a third party segmentation method. Our approach incorporates a shape prior whose registration on the input image is learned by the model.
2 Our Method
The goal of our method is to segment the LV, the RV and the MYO of a 3D raw MR image . This is done by predicting a 3D label map also of size and whose voxels contain a label , where ”Back” stands for tissues different than the other three. Following the ACDC structure, X is a series of short axis slices starting from the mitral valve down to the apex [16] (please refer to the ACDC website for more details [17]). In order to enforce a clinically plausible result, a shape prior is provided which encapsulates the relative position of the LV, RV and MYO. The main challenge when using a shape prior such as this one is to align it correctly onto the input data X [18]. Since the size and the orientation of the heart does not vary much from one patient to another, we register on by translating the center of on the cardiac center of mass (CoM) of . The CoM is computed based on the location of the pericardium (obtained from MYO and RV) in each slice. Since is not provided with the input image, our method has a regression module designed to predict it. To our knowledge, our approach is the first to incorporate a shape prior as well as its registration within an endtoend trainable structure.
2.1 Shape prior
The shape prior
is a 3D volume which encodes the probability of a 3D location
of being part of a certain class (Back, LV, RV, or MYO). We estimate this probability by computing the pixelwise empirical proportion of each class based on the groundtruth label fields
of the training dataset:where is an indicator function which returns 1 when and 0 otherwise, and is the total number of training images.
These probabilities are put into a volume where stands for the 3 classes (RV, MYO, LV)^{1}^{1}1No need to store the probability of ”Back” since the 4 probabilities sum up to 1.,
stands for the number of interpolated slices (from the base to the apex) and
is the inplane size. Note that prior to compute , we realign the CoM of all training label fields into a common space and crop a region around that center.2.2 Loss
The goal of our system is to predict a correct label field given an input image while automatically aligning the shape prior on by aligning their center of masses . In order to do so, our loss incorporates the following four terms:
Here and are the crossentropies of the predicted labels and the predicted contours. In this equation, stands for the class index, is a pixel location, and and are constants. is the true probability that pixel is in class , and is the output of our model for pixel and class while and are contours extracted from and . Note that the use of a contour loss has been shown by Luo et al [19] to enforce a better precision. As for , it is the Euclidean distance between the predicted CoM and the true CoM, and is the prior loss.
2.3 Proposed network
The goal of our CNN is to learn good features for predicting the label field as well as the CoM which is used to align the shape prior on the input image . In other words, a network that has the ability of minimizing the loss of Eq.(2.2). With that objective, good features must account for both the global and the local context: the global context to differentiate the heart from the surrounding organs and estimate its CoM, and the local context to ensure accurate segmentation and prediction of contours.
In that perspective, we implemented a gridlike CNN network with 3 columns and 5 rows (c.f. Fig. 1). The input to our model (upper left) is an MR image and the shape prior of the corresponding slice, while the output is the CoM (bottom right) and a label field (top right) also of size . Note that a common issue with MRI cardiac images is the fact that along the 2D shortaxis, the location of the heart sometimes get shifted from one slice to another due to different breathholds during successive acquisitions. As a consequence, instead of processing a 3D volume as a whole, we feed the network with 2D slices as shown in Fig. 1 and reshape the 3D volume by stacking up the resulting 2D label fields.
As we get deeper in the network (from CONV1 to CONV5), the extracted features involve a larger context of the input image. Since the CONV5 layer includes highlevel features from the entire image, we use it to predict the cardiac CoM of the input image
. The second column contains 4 convolution layers (all without maxpooling) used to compute features at various resolutions. The last column aggregates features from the lowest to the highest resolution. The UNCONV4 layer contains both global and local features which we use to segment the image. Note that this grid structure is similar to the UNet except for the middle CONV6 to 9 layers and the fact that we use the CONV5 features to estimate a CoM
. Each conv layer has areceptive field and its feature maps have the same size than their input (zero padding). We also batch normalize each feature map, use the ReLU activation function, and dropout
[20] to have a better generalization. Please note that so instead of having millions parameters like the UNet, our gridNet has approximately millions parameters.2.3.1 Estimating the center of mass
The CoM is estimated with a regression module located after the CONV5 layer. An average pooling layer (AVG1) is used to reduce the number of features fed to the FC1 layer. The FC1, FC2 and FC3 layers are all fully connected and the output of FC3 is the () prediction of .
2.3.2 Estimating the label field
The output of the UNCONV4 layer has 4 feature maps which we append to the shape prior . The MERGE1 layer realigns based on the estimated CoM and use zero padding to make sure has a inplane size. In this way, the output of MERGE1 has 7 feature maps: 4 from UNCONV4 and 3 from . The last CONV10 layer is used to squash those 7 feature maps down to a 4D output.
2.3.3 Training
2.3.4 Pre and Post processing
The input 3D images
are preprocessed by clamping the 4% outlying grayscale values, zerocentering the grayscales and normalize it by their standard deviation. Once training is over, we remove outliers by keeping the largest connected component for each class on the overall predicted 3D volume. Note that these pre and post processes are applied to every model tested in Section
3.3 Experimental Setup and Results
3.1 Dataset, evaluation criteria, and other methods
Our system was trained and tested on the 2017 ACDC dataset. Since the testing dataset was not available as of the paper submission deadline, we trained our system on 75 exams and validated it on the remaining 25 exams. The exams are divided into 5 evenly distributed groups: dilated cardiomyopathy, hypertrophic cardiomyopathy, myocardial infarction with altered left ventricular ejection fraction, abnormal right ventricle and patients without cardiac disease.
Cine MR images were acquired in breath hold with a retrospective or prospective gating and with a SSFP sequence in 2chambers, 4chambers and in short axis orientations. A series of short axis slices cover the LV from the base to the apex, with a thickness of 5 to 8 mm and often an interslice gap of 5 mm. The spatial resolution goes from 0.83 to 1.75 mm/pixel. For more details on the dataset, please refer to the ACDC website [17].
In order to gauge performances, we report the clinical and geometrical metrics used in the ACDC challenge. The clinical metrics are the correlation coefficients for the cavity volume and the ejection fraction (EF) of the LV and RV, as well as correlation coefficient of the myocardial mass for the End Diastolic (ED) phase. As for the geometrical metrics, we report the Dice coefficient [22] and the Hausdorff distance [23] for all 3 regions and phases.
We compared our method with two recent CNN methods: the convdeconv CNN by Noh et al. [12] and the UNet by Ronneberger et al. [13]. We chose those methods based on their excellent segmentation capabilities but also because their architecture can be seen as a particular case of our approach.
Dice LV  Dice RV  Dice MYO  
ED  ES  ED  ES  ED  ES  
ConvDeconv  0.92  0.87  0.82  0.64  0.76  0.81 
UNet  0.96  0.92  0.88  0.79  0.78  0.76 
Our method  0.96  0.94  0.94  0.87  0.89  0.90 
HD LV (mm)  HD RV (mm)  HD MYO (mm)  
ED  ES  ED  ES  ED  ES  
ConvDeconv  8.77  10.34  22.59  28.45  13.92  11.64 
UNet  6.17  8.29  20.51  21.20  15.25  17.92 
Our method  5.96  6.57  13.48  16.66  8.68  8.99 
Corr EF LV  Corr EF RV  Corr MYO ED  Corr LV vol  Corr RV vol  
ConvDeconv  0.988  0.764  0.927  0.990  0.957 
UNet  0.991  0.824  0.921  0.995  0.821 
Our method  0.992  0.898  0.975  0.997  0.978 
3.2 Experimental results
As can be seen in Table 1, our method outperforms both the convdeconv and the UNet. The Dice coefficient is better by an average of %, and the Hausdorff distance is lower for our method by an average of mm. This is a strong indication that the contour loss and the shape prior help improving results, especially close to the boundaries. Without much surprise, the RV is the most challenging organ, mostly because of its complicated shape, the partial volume effect close to the free wall, and intensity inhomogeneities. The clinical metrics are also in favor of our method. Based on the correlation coefficient, our method is overall better than convdeconv and UNet, especially for the myocardium (bottom of Table 1).
Careful inspection reveal that errors are not uniformly distributed. Interestingly, convdeconv and UNet produce accurate results on most slices of each 3D volume as illustrated in the first two rows of Fig.
2. That said, they often get to generate a distorted result for 1 or 2 slices (out of 7 to 17) which end up decreasing the Dice score and increasing the Hausdorff distance. This situation is shown in rows 3,4, and 5 of Fig. 2. Overall, the right ventricle is the most challenging region for all three methods. It is especially true at the base of the heart, next to the mitral valve where the RV is connected to the pulmonary artery. This is illustrated in the last row of Fig. 2.4 Conclusion
We proposed a new CNN method specifically designed for MRI cardiac segmentation. It implements a ”grid” architecture which is a generalization of the Unet. It uses a shape prior which is automatically aligned on the input image with a regression method. The shape prior forces the method to produce anatomically plausible results. Experimental results reveal that our method produces an average DICE score of 0.9. This shows that an approach such as our’s can be seen as a decisive step towards a fullyautomatic clinical tool used to compute functional parameters of the heart. In the future, we plan on generalizing our method to other modalities (such as echocardiography or CTscan) as well as other organs such as the brain. In that case, we shall propose a more elaborate registration module which shall implement an affine transformation.
References
 [1] F. H. Epstein, “MRI of left ventricular function,” J Nucl. Cardiol, vol. 14, no. 5, pp. 729–744, 2007.
 [2] G. W. Vick, “The gold standard for noninvasive imaging in coronary heart disease: magnetic resonance imaging,” Cur. op. in card., vol. 24, no. 6, pp. 567–579, 2009.
 [3] P. Peng et al, “A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging,” MAGMA, vol. 29, no. 2, pp. 155–195, 2016.
 [4] C. Petitjean et al, “Right ventricle segmentation from cardiac MRI: A collation study,” Med. Image Anal., vol. 19, no. 1, pp. 187–202, 2015.
 [5] D. A. Auger et al, “Semiautomated left ventricular segmentation based on a guide point model approach for 3D cine DENSE cardiovascular magnetic resonance,” J Cardiovasc Magn Reson, vol. 16, no. 1, p. 8, 2014.
 [6] D. Grosgeorge, C. Petitjean, J.N. Dacher, and S. Ruan, “Graph cut segmentation with a statistical shape model in cardiac MRI,” CVIU, vol. 117, no. 9, pp. 1027–1035, 2013.
 [7] C. Petitjean and J. Dacher, “A review of segmentation methods in short axis cardiac MR images,” Med. Image Anal., vol. 15, no. 2, pp. 169–184, 2011.
 [8] L. Wang et al, “Left ventricle: Fully automated segmentation based on spatiotemporal continuity and myocardium information in cine cardiac magnetic resonance imaging (LVFAST),” BioMed Res. Int., vol. 2015, 2015.
 [9] Y. Liu and G. Captur et al, “Distance regularized two level sets for segmentation of left and right ventricles from cineMRI,” Mag. res. img., vol. 34, no. 5, pp. 699–706, 2016.
 [10] P. V. Tran, “A fully convolutional neural network for cardiac segmentation in shortaxis MRI,” arXiv preprint arXiv:1604.00494, 2016.
 [11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. of CVPR, pp. 3431–3440, 2015.
 [12] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proc. of ICCV, 2015.
 [13] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in in Proc. of MICCAI, pp. 234–241, 2015.
 [14] L. K. Tan et al, “Cardiac left ventricle segmentation using convolutional neural network regression,” in Proc. of IECBES, pp. 490–493, IEEE, 2016.
 [15] T. A. Ngo, Z. Lu, and G. Carneiro, “Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance,” Med. Image Anal., vol. 35, no. 1, pp. 159–171, 2017.
 [16] B. Kastler, “Cardiovascular anatomy and atlas of mr normal anatomy,” in MRI of Cardiovascular Malformations, ch. 2, pp. 17–39, Springer, 2011.
 [17] “ACDCMICCAI challenge.” http://acdc.creatis.insalyon.fr/.
 [18] V. Tavakoli and A. A. Amini, “A survey of shapedbased registration and segmentation techniques for cardiac images,” CVIU, vol. 117, no. 9, pp. 966–989, 2013.

[19]
Z. Luo, A. Mishra, A. Achkar, J. Eichel, S.Z. Li, and P.M. Jodoin, “Nonlocal deep features for salient object detection,” in
in proc of CVPR, 2017.  [20] N. Srivastava and G. Hinton et al, “Dropout: a simple way to prevent neural networks from overfitting.,” J. of Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
 [21] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of ICLR, 2015.
 [22] K. H. Zou et al, “Statistical validation of image segmentation quality based on a spatial overlap index 1: Scientific reports,” Aca. rad., vol. 11, no. 2, pp. 178–189, 2004.
 [23] D. Huttenlocher, G. Klanderman, and R. W.J., “Comparing images using the hausdorff distance,” IEEE trans PAMI, vol. 15, no. 9, pp. 850–863, 1993.
Comments
There are no comments yet.