MRI is the gold standard modality for cardiac assessment [1, 2]. In particular, the use of kinetic MR images along the short axis orientation of the heart allows accurate evaluation of the function of the left and right ventricles. For these examinations, one has to delineate the left ventricular endocardium (LV), the left ventricular epicardium (or myocardium - MYO) and the right ventricular endocardium (RV) in order to calculate the volume of the cavities in diastole and systole (and thus the ejection fraction), as well as the myocardial mass . These parameters are mandatory to detect and quantify different pathologies.
As of today, the clinical use of cardiovascular MRI is hampered by the amount of data to be processed (often more than 10 short axis slices and more than 20 phases per slice). Since the manual delineation of all 3D images is clinically impracticable, several semi-automatic methods have been proposed, most of which being based on active contours, dynamic programming, graph cut or some atlas fitting strategies [4, 5, 3, 6, 7]. Unfortunately, these methods are far from real time due to the manual interaction which they require. Also, most of them are ill-suited for segmenting simultaneously the LV, the RV, and the MYO.
, fully-automatic segmentation methods are usually articulated around a machine learning method
, and more recently deep learning (DL) and convolutional neural networks (CNN). Among the best CNN segmentation models are those involving a series of convolutions and pooling layers followed by one  or several  deconvolution layers. In 2015, Ronneberger et al.  proposed the U-Net, a CNN which involves connections between the conv and deconv layers and whose performances on medical images are astonishing. Recently, DL methods have been proposed to segment cardiac images [10, 14, 15]. While Tran  applied the well-known fCNN  on MR cardiac images, Tan et al.  used CNN to localize (but not segment) the LV, and Ngo et al.  use deep belief nets again to localize but not segment the LV. These methods are doing only one or two class segmentation and do not incorporate the shape prior inside the network.
In this paper we propose the first CNN method specifically designed to segment the LV, RV and MYO without a third party segmentation method. Our approach incorporates a shape prior whose registration on the input image is learned by the model.
2 Our Method
The goal of our method is to segment the LV, the RV and the MYO of a 3D raw MR image . This is done by predicting a 3D label map also of size and whose voxels contain a label , where ”Back” stands for tissues different than the other three. Following the ACDC structure, X is a series of short axis slices starting from the mitral valve down to the apex  (please refer to the ACDC website for more details ). In order to enforce a clinically plausible result, a shape prior is provided which encapsulates the relative position of the LV, RV and MYO. The main challenge when using a shape prior such as this one is to align it correctly onto the input data X . Since the size and the orientation of the heart does not vary much from one patient to another, we register on by translating the center of on the cardiac center of mass (CoM) of . The CoM is computed based on the location of the pericardium (obtained from MYO and RV) in each slice. Since is not provided with the input image, our method has a regression module designed to predict it. To our knowledge, our approach is the first to incorporate a shape prior as well as its registration within an end-to-end trainable structure.
2.1 Shape prior
The shape prior
is a 3D volume which encodes the probability of a 3D location
of being part of a certain class (Back, LV, RV, or MYO). We estimate this probability by computing the pixel-wise empirical proportion of each class based on the groundtruth label fieldsof the training dataset:
where is an indicator function which returns 1 when and 0 otherwise, and is the total number of training images.
These probabilities are put into a volume where stands for the 3 classes (RV, MYO, LV)111No need to store the probability of ”Back” since the 4 probabilities sum up to 1.,
stands for the number of interpolated slices (from the base to the apex) andis the inplane size. Note that prior to compute , we realign the CoM of all training label fields into a common space and crop a region around that center.
The goal of our system is to predict a correct label field given an input image while automatically aligning the shape prior on by aligning their center of masses . In order to do so, our loss incorporates the following four terms:
Here and are the cross-entropies of the predicted labels and the predicted contours. In this equation, stands for the class index, is a pixel location, and and are constants. is the true probability that pixel is in class , and is the output of our model for pixel and class while and are contours extracted from and . Note that the use of a contour loss has been shown by Luo et al  to enforce a better precision. As for , it is the Euclidean distance between the predicted CoM and the true CoM, and is the prior loss.
2.3 Proposed network
The goal of our CNN is to learn good features for predicting the label field as well as the CoM which is used to align the shape prior on the input image . In other words, a network that has the ability of minimizing the loss of Eq.(2.2). With that objective, good features must account for both the global and the local context: the global context to differentiate the heart from the surrounding organs and estimate its CoM, and the local context to ensure accurate segmentation and prediction of contours.
In that perspective, we implemented a grid-like CNN network with 3 columns and 5 rows (c.f. Fig. 1). The input to our model (upper left) is an MR image and the shape prior of the corresponding slice, while the output is the CoM (bottom right) and a label field (top right) also of size . Note that a common issue with MRI cardiac images is the fact that along the 2D short-axis, the location of the heart sometimes get shifted from one slice to another due to different breath-holds during successive acquisitions. As a consequence, instead of processing a 3D volume as a whole, we feed the network with 2D slices as shown in Fig. 1 and reshape the 3D volume by stacking up the resulting 2D label fields.
As we get deeper in the network (from CONV-1 to CONV-5), the extracted features involve a larger context of the input image. Since the CONV-5 layer includes high-level features from the entire image, we use it to predict the cardiac CoM of the input image
. The second column contains 4 convolution layers (all without max-pooling) used to compute features at various resolutions. The last column aggregates features from the lowest to the highest resolution. The UNCONV-4 layer contains both global and local features which we use to segment the image. Note that this grid structure is similar to the U-Net except for the middle CONV-6 to 9 layers and the fact that we use the CONV-5 features to estimate a CoM. Each conv layer has a 20] to have a better generalization. Please note that so instead of having millions parameters like the U-Net, our gridNet has approximately millions parameters.
2.3.1 Estimating the center of mass
The CoM is estimated with a regression module located after the CONV-5 layer. An average pooling layer (AVG-1) is used to reduce the number of features fed to the FC-1 layer. The FC-1, FC-2 and FC-3 layers are all fully connected and the output of FC-3 is the () prediction of .
2.3.2 Estimating the label field
The output of the UNCONV-4 layer has 4 feature maps which we append to the shape prior . The MERGE-1 layer realigns based on the estimated CoM and use zero padding to make sure has a inplane size. In this way, the output of MERGE-1 has 7 feature maps: 4 from UNCONV-4 and 3 from . The last CONV-10 layer is used to squash those 7 feature maps down to a 4D output.
2.3.4 Pre and Post processing
The input 3D images
are pre-processed by clamping the 4% outlying grayscale values, zero-centering the grayscales and normalize it by their standard deviation. Once training is over, we remove outliers by keeping the largest connected component for each class on the overall predicted 3D volume. Note that these pre and post processes are applied to every model tested in Section3.
3 Experimental Setup and Results
3.1 Dataset, evaluation criteria, and other methods
Our system was trained and tested on the 2017 ACDC dataset. Since the testing dataset was not available as of the paper submission deadline, we trained our system on 75 exams and validated it on the remaining 25 exams. The exams are divided into 5 evenly distributed groups: dilated cardiomyopathy, hypertrophic cardiomyopathy, myocardial infarction with altered left ventricular ejection fraction, abnormal right ventricle and patients without cardiac disease.
Cine MR images were acquired in breath hold with a retrospective or prospective gating and with a SSFP sequence in 2-chambers, 4-chambers and in short axis orientations. A series of short axis slices cover the LV from the base to the apex, with a thickness of 5 to 8 mm and often an interslice gap of 5 mm. The spatial resolution goes from 0.83 to 1.75 mm/pixel. For more details on the dataset, please refer to the ACDC website .
In order to gauge performances, we report the clinical and geometrical metrics used in the ACDC challenge. The clinical metrics are the correlation coefficients for the cavity volume and the ejection fraction (EF) of the LV and RV, as well as correlation coefficient of the myocardial mass for the End Diastolic (ED) phase. As for the geometrical metrics, we report the Dice coefficient  and the Hausdorff distance  for all 3 regions and phases.
We compared our method with two recent CNN methods: the conv-deconv CNN by Noh et al.  and the U-Net by Ronneberger et al. . We chose those methods based on their excellent segmentation capabilities but also because their architecture can be seen as a particular case of our approach.
|Dice LV||Dice RV||Dice MYO|
|HD LV (mm)||HD RV (mm)||HD MYO (mm)|
|Corr EF LV||Corr EF RV||Corr MYO ED||Corr LV vol||Corr RV vol|
3.2 Experimental results
As can be seen in Table 1, our method outperforms both the conv-deconv and the U-Net. The Dice coefficient is better by an average of %, and the Hausdorff distance is lower for our method by an average of mm. This is a strong indication that the contour loss and the shape prior help improving results, especially close to the boundaries. Without much surprise, the RV is the most challenging organ, mostly because of its complicated shape, the partial volume effect close to the free wall, and intensity inhomogeneities. The clinical metrics are also in favor of our method. Based on the correlation coefficient, our method is overall better than conv-deconv and UNet, especially for the myocardium (bottom of Table 1).
Careful inspection reveal that errors are not uniformly distributed. Interestingly, conv-deconv and U-Net produce accurate results on most slices of each 3D volume as illustrated in the first two rows of Fig.2. That said, they often get to generate a distorted result for 1 or 2 slices (out of 7 to 17) which end up decreasing the Dice score and increasing the Hausdorff distance. This situation is shown in rows 3,4, and 5 of Fig. 2. Overall, the right ventricle is the most challenging region for all three methods. It is especially true at the base of the heart, next to the mitral valve where the RV is connected to the pulmonary artery. This is illustrated in the last row of Fig. 2.
We proposed a new CNN method specifically designed for MRI cardiac segmentation. It implements a ”grid” architecture which is a generalization of the U-net. It uses a shape prior which is automatically aligned on the input image with a regression method. The shape prior forces the method to produce anatomically plausible results. Experimental results reveal that our method produces an average DICE score of 0.9. This shows that an approach such as our’s can be seen as a decisive step towards a fully-automatic clinical tool used to compute functional parameters of the heart. In the future, we plan on generalizing our method to other modalities (such as echocardiography or CT-scan) as well as other organs such as the brain. In that case, we shall propose a more elaborate registration module which shall implement an affine transformation.
-  F. H. Epstein, “MRI of left ventricular function,” J Nucl. Cardiol, vol. 14, no. 5, pp. 729–744, 2007.
-  G. W. Vick, “The gold standard for noninvasive imaging in coronary heart disease: magnetic resonance imaging,” Cur. op. in card., vol. 24, no. 6, pp. 567–579, 2009.
-  P. Peng et al, “A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging,” MAGMA, vol. 29, no. 2, pp. 155–195, 2016.
-  C. Petitjean et al, “Right ventricle segmentation from cardiac MRI: A collation study,” Med. Image Anal., vol. 19, no. 1, pp. 187–202, 2015.
-  D. A. Auger et al, “Semi-automated left ventricular segmentation based on a guide point model approach for 3D cine DENSE cardiovascular magnetic resonance,” J Cardiovasc Magn Reson, vol. 16, no. 1, p. 8, 2014.
-  D. Grosgeorge, C. Petitjean, J.-N. Dacher, and S. Ruan, “Graph cut segmentation with a statistical shape model in cardiac MRI,” CVIU, vol. 117, no. 9, pp. 1027–1035, 2013.
-  C. Petitjean and J. Dacher, “A review of segmentation methods in short axis cardiac MR images,” Med. Image Anal., vol. 15, no. 2, pp. 169–184, 2011.
-  L. Wang et al, “Left ventricle: Fully automated segmentation based on spatiotemporal continuity and myocardium information in cine cardiac magnetic resonance imaging (LV-FAST),” BioMed Res. Int., vol. 2015, 2015.
-  Y. Liu and G. Captur et al, “Distance regularized two level sets for segmentation of left and right ventricles from cine-MRI,” Mag. res. img., vol. 34, no. 5, pp. 699–706, 2016.
-  P. V. Tran, “A fully convolutional neural network for cardiac segmentation in short-axis MRI,” arXiv preprint arXiv:1604.00494, 2016.
-  J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proc. of CVPR, pp. 3431–3440, 2015.
-  H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proc. of ICCV, 2015.
-  O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in in Proc. of MICCAI, pp. 234–241, 2015.
-  L. K. Tan et al, “Cardiac left ventricle segmentation using convolutional neural network regression,” in Proc. of IECBES, pp. 490–493, IEEE, 2016.
-  T. A. Ngo, Z. Lu, and G. Carneiro, “Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance,” Med. Image Anal., vol. 35, no. 1, pp. 159–171, 2017.
-  B. Kastler, “Cardiovascular anatomy and atlas of mr normal anatomy,” in MRI of Cardiovascular Malformations, ch. 2, pp. 17–39, Springer, 2011.
-  “ACDC-MICCAI challenge.” http://acdc.creatis.insa-lyon.fr/.
-  V. Tavakoli and A. A. Amini, “A survey of shaped-based registration and segmentation techniques for cardiac images,” CVIU, vol. 117, no. 9, pp. 966–989, 2013.
Z. Luo, A. Mishra, A. Achkar, J. Eichel, S.-Z. Li, and P.-M. Jodoin, “Non-local deep features for salient object detection,” inin proc of CVPR, 2017.
-  N. Srivastava and G. Hinton et al, “Dropout: a simple way to prevent neural networks from overfitting.,” J. of Mach. Learn. Res., vol. 15, pp. 1929–1958, 2014.
-  D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. of ICLR, 2015.
-  K. H. Zou et al, “Statistical validation of image segmentation quality based on a spatial overlap index 1: Scientific reports,” Aca. rad., vol. 11, no. 2, pp. 178–189, 2004.
-  D. Huttenlocher, G. Klanderman, and R. W.J., “Comparing images using the hausdorff distance,” IEEE trans PAMI, vol. 15, no. 9, pp. 850–863, 1993.