Cardiovascular disease is one of the major causes of death in the world. Physicians use imaging technologies such as magnetic resonance imaging (MRI) to estimate structural (e.g. volume) and functional (e.g. ejection fraction) cardiac parameters for both diagnosis and disease management. Fully-automated estimation of such parameters can facilitate early diagnosis of the disease and has the potential to remove the more mechanistic aspects of a radiologist’s assessment. As such, lately there has been increasing interest in machine learning algorithms for fully automatic left-ventricle (LV) segmentation[1, 17, 8, 10, 12]. This is a challenging task due to the variability of LV shape across slices, cardiac phases, patients and scanning machines as well as weak boundaries of LV due to the presence of blood flow, papillary muscles and trabeculations. A review of LV segmentation methods in short-axis cardiac MR images can be found in .
The main image analysis approaches to LV segmentation can be grouped into three broad categories: active contour models, machine learning models, and hybrid versions that combine elements of the two approaches. Active contour models with either explicit  or implicit  contour representations minimize an energy function composed of internal and external constraints. The internal constraints represent continuity and smoothness of the contour and external constraints represent appearance and shape of the target object. However, designing appropriate energy functions that can handle all sources of variability is challenging. Also, the quality of the segmentations produced by these methods typically depends on the region-of-interest (ROI) used to initialise the algorithms. Machine learning approaches have been proposed to circumvent some of these issues [9, 18, 17, 1] at the expense of collecting large training datasets with a sufficient number of examples. Investigating hybrid methods that combine some elements of both approaches is an active research area .
. Typically, these solutions consists of three distinct stages carried out sequentially. Initially, the LV is localised within each two-dimensional slice; then the LV is segmented, and finally the segmentation is further refined to improve its quality. For instance, a pipeline consisting of Deep Belief Networks (DBNs) for both localisation and segmentation, followed by a level-set methodology, has shown to generate high-quality segmentations
. In more recent work, a different pipeline has been proposed that consists of convolutional neural networks for initial LV detection, followed by a segmentation step deploying stacked autoencoders, and a fine-tuning strategy also based on level-sets methodology. The latter approach has been proved to produce state-of-the-art results on the MICCAI 2009 LV segmentation challenge . Both approaches share a number of common features. First, the segmentation is carried out using two-dimensional patches that are independently extracted from each MRI slice. Second, they use a separate architecture for the two tasks, localization and segmentation. Third, different neural network architectures are trained for cardiac MR slices containing the base and apex of the heart, due to the observed heterogeneity in local shape variability.
In this work we investigate a neural network architecture, trained end-to-end, that learns to detect and segment the LV jointly from the the entire stack of short-axis images rather than operating on individual slices. Recently, fully convolutional networks (FCN) have been proposed for the segmentation of 2D images . They take arbitrarily sized input images, and use feature pooling coupled with an upsampling step to produce same size outputs delivering the segmentation. Compared to more traditional sliding-window approaches, FCNs are more efficient. They have received increasing interest lately as they unify object localization and segmentation in a single process by extracting both global and local context effectively [16, 22]
. Applications of FCNs to medical imaging segmentation problems have also started to appear, for instance for the identification of neuronal structures in electron microscopic recordings. In independent work, Valipour et al.  have recently adapted recurrent fully convolutional networks for video segmentation.
Here we propose an extension of FCNs, called Recurrent Fully-Convolutional Networks (RFCN), to directly address the segmentation problem in multi-slice MR images. We are motivated by the desire to exploit the spatial dependences that are observed across adjacent slices and learn image features that capture the global anatomical structure of the heart from the full image stack. We investigate whether exploiting this information is beneficial for accurate anatomical segmentation, especially for cardiac regions with weak boundaries, e.g. poor structural contrast due to the presence of blood flow, papillary muscles and trabeculations.
Our experiments are based on two independent datasets consisting of short-axis cardiac MR images for which the endocardium has been manually segmented by expert radiologists in each axial slice. Further details are provided below.
2.1 MICCAI dataset
The MICCAI 2009 LV Segmentation Challenge  dataset was made publicly available by the Sunnybrook Health Sciences Center (Canada) and has been extensively used to compare a number of LV segmentation algorithms [12, 10, 8, 18, 17, 1]. It consists of CINE MRI images from a number of different pathologies. The individual exams have been pre-grouped into training, validation and online testing subsets. Each subset contains cases of which heart failure with infarction (HF-I), heart failure without infarction (HF), LV hypertrophy (HYP) and healthy subjects. However, the clinical information has not been used by any of the algorithms discussed here and in our experiments. All the images were obtained during breath-hold sessions lasting seconds with a temporal resolution of cardiac phases over the heart cycle. A typical phase, end diastole (ED) or end systole (ES), contains short-axis slices obtained from the base to apex. In all the images, the slice thickness is mm, the inter-slice gap is mm, the field of view is mm mm and the pixel size is . In all samples, LV endocardial contours were drawn by an experienced cardiologist by taking 2D slices at both the end-systolic and end-diastolic phases, and then independently confirmed by a second reader. The manual segmentations were used as ground truth for the evaluation of the proposed models. Each set consists of sequences ( samples for each one of the two cardiac phases) with an average sequence length slices.
2.2 PRETERM dataset
A second and larger dataset was used for an independent evaluation of all the cardiac segmentation algorithms. The dataset consists of subjects used to study perinatal factors modifying the left ventricular parameter . All the individuals are between to years of age. Of these, were followed prospectively since preterm birth, and are characterised by an average gestational age of weeks and a birth weight of kg. The remaining subjects were born at term to uncomplicated pregnancies. Short-axis CINE MRI stacks were acquired with a -T Siemens Sonata scanner. All images have a mm slice thickness and mm inter-slice gap, the in-plane resolution is mm (min. , max ). All cardiovascular MRI was prospectively ECG gated and acquired during end-expiration breath holding. LV slices and endocardial masks were resampled into a homogeneous in-plane resolution of 2mm, which yield slice pixel size of . Left ventricular short-axis endocardial borders were manually contoured by an expert reader at ES and ED using Siemens analytic software (Argus, Siemens Medical Solutions, Germany). The dataset was randomly divided into training, validation and testing sets of sizes and , respectively.
3 Recurrent fully-convolutional networks
The proposed recurrent fully-convolutional network (RFCN) is an extension of the architecture originally introduced in  for predicting pixel-wise, dense outputs from arbitrarily-sized inputs. The main idea underlying FCNs is to extend a contracting path, in which a sequence of pooling operators progressively reduces the size of the network, by adding successive layers where pooling operators are replaced by upsampling operators. In this respect, our architecture is similar to U-net  where the expanding path is characterised by a large number of feature channels allowing the network to propagate context information to higher resolution layers.
Our purpose is to model the full stack of short-axis images extracted from cardiac MRI and improve the segmentation of the left ventricle in each slice by leveraging inter-slice spatial dependences. The input is the entire sequence of slices obtained at a particular cardiac phase (ED or ES) and the output is the sequence of corresponding (manually produced) left-ventricular masks. Each input and output image is assumed to have equal size. A schematic illustration is given in Figure 1. As can be seen there, slices around the base of the heart (at the top) cover larger LV regions and show relatively clear boundaries whereas slices around the apex (at the bottom) cover smaller LV regions and present more blurred boundaries. Learning the typical shape deformations that are observed as we move from the base towards the apex is expected to improve the overall quality of the segmentation in challenging regions around the apex.
Three main building blocks characterise the proposed RFCN as illustrated in Figure 2
: a feature-extraction (contracting) path, a global-feature component and an upsampling (expanding) path. The feature-extraction component, which is independently applied to each image in the stack, deploys successive convolution and max-pooling operations to learn higher level features and remove local redundancy. In our architecture, this component consists of a repeated block of twoconvolutional layers (with stride of max pooling layer (with stride of ). We doubled the number of feature channels after each max pooling layer to maintain enough context, i.e each block takes an input of size and generates output feature maps of size .
At the end of this contracting path the network has extracted the most compressed features carrying global context. The global feature component starts here with a convolutional layer (with stride of ) followed by a ReLU. We denote the output of this layer where indicates the slice index, i.e. . This output consists of feature maps. In an attempt to extract global features that capture the spatial changes observed when moving from the base to the apex of the heart, we introduce a recurrent mechanism mapping into a new set of features, , where is a non-linear function, and the size of is the same as the size of . Another convolutional layer (with stride of
) followed by a ReLU is then applied to complete the global-feature extraction block. Given that training recurrent architectures is particularly difficult due to the well-document vanishing gradient problem, several options were considered for the implementation of recurrent function
, including a Long Short-Term Memory (LSTM)
and Gated Recurrent Units (GRUs) module. GRUs in particular have been shown to achieve a performance comparable to LSTM on a number of tasks involving sequential data whilst requiring fewer parameters and less memory . Here we have chosen to use a convolutional variant of GRU so that the local correlation of the input images are preserved whilst achieving a notable reduction in the number of parameters compared to its non-convolutional counter part.
For every slice, the dense feature maps that have been learned by the convolutional GRU module are then upsampled to compensate for the input size reduction caused by the max-pooling operations. The upsampled features are concatenated with a high resolution parallel layer aligned to the feature-extraction component, similarly to the U-net architecture . Our upsampling component consists of a repeated block of a convolutional layer (with a fractional stride of ), a feature map concatenation module and two convolutional layers (with stride of 1) followed by ReLU. The feature map concatenation module combines the outputs of the upsample layer and parallel feature-extraction block. Each block of the upsampling component takes a three-dimensional input and output
dimensional tensor. A convolutional operation with fractional stride is employed to compensate the reduction in input size due to the max pooling operation. Even though the upsampling procedure smooths out the boundaries of the object to be segmented, the concatenation of up-sampled feature maps with high-resolution feature maps helps mitigate this smoothing problem by providing better local and boundary information. The final segmentation is obtained by using a
convolutional layer, which maps the output of the upsampling component onto the two classes, i.e. LV and background. The probability for each class is given by a softmax function across all pixel locations.
4 Other architectures and model training
. A DBN was first used to detect the region of interest containing the LV. Anatomical segmentation was then carried out using distance-regularised level sets, which were modified to leverage prior shape information inferred by a separate DBN. In these models, as in FCNs, each slice in the short-axis stack is segmented independently of all the others. The main building block of a DBN model is a restricted Boltzmann machine (RBM), typically trained using the contrast divergence algorithm. In some of our experiments, we have assessed the performance of DBNs for LV segmentation comparably to FCNs and the proposed RFCNs.
In order to further investigate whether modelling the dependence across slices typically yields improved performance, and motivated by the existing body of work on DBNs, we have also assessed the performance of a recurrent version of restricted Boltzmann machines (RRBM), originally proposed to learn human body motion , but never used for LV segmentation. RRBMs are stacked together to form what we call a recurrent deep belief network (RDBN). Similarly to the proposed RFCN, RDBN takes the entire sequence of short-axis slices as input and leverages the spatial correlations through additional bias units. For further information, we refer the reader to the original work .
The two convolutional architectures, FCN and RFCN, were trained by minimizing the cross-entropy objective function. FCN was trained using a stochastic gradient descent algorithm with momentum whereas RFCN was trained using a stochastic gradient descent algorithm with RMSProp. Back-propagation was used to compute the gradient of the cross-entropy objective function with respect to all parameters of the model, including the GRU component in the case of RFCN. We also learned
as required by the first slice of the sequence. In each block, batch normalization was added after each convolutional layer, i.e. just before the max-pooling and upsampling layers. All reported results refer to the best out of experiments in which the models were initialised with random parameters. RFCN was initialised using weights obtained from FCN, which reduces the training time and provided the good initial weights. Both the DBN and RDBN architectures were trained using the contrast divergence algorithm . Dropout  was found to improve their overall performance. For all these models, best results were achieved using a learning rate of with constant decay of
after each epoch, a momentum ofand weight decay of . At the training phase, both the MICCAI and PRETERM datasets were augmented by generating additional artificial training images to prevent model overfitting. During training, we performed translation ( pixels) and rotation () data augmentation, which was found to yield better performance.
5 Experimental results
This section presents an empirical evaluation of several LV endocardium segmentation algorithms using three performance metrics: good contours (GC) , Dice index, and average perpendicular distance (APD) between manually drawn and predicted contours . In order to make our experimental results comparable with published studies on MICCAI dataset, all models were validated using the online set, and we report on results obtained on the validation set. Table 1 summarises the experimental results. On the MICCAI dataset, the DBN-based results presented in  include a Dice index of , a GC of and an APD of mm whereas the pipeline described in  results in a Dice index of , a GC of and an APD of mm (before further post-processing). A comparable Dice index is obtained by both FCN and RFCN, which yield higher GC and smaller ADP. Here RFCN outperforms FCN with a substantially improved ADP of mm.
The PRETERM dataset was modelled using the the same architectures, without further customisation. The results of this application are also summarised in Table 1. For this dataset, we compared the performance of four different architectures: FCN, RFCN, DBN and RDBN. The latter two models were given as input a region of interest containing the LV thus conferring them an advantage compared to FCN and RFCN. On this dataset we were not able to test the recently proposed pipeline described in , which relies on multiple stages. As in the MICCAI dataset, the fully convolutional architectures have achieved superior performance. RFCN has outperformed all other architectures in terms of Dice index and APD, which was found to be as small as mm. In comparison, DBN with known LV location yields an APD of mm. RDBN yields higher GC and lower APD compared to DBN thus providing additional evidence that performance gains can be obtained by modelling intra-slice dependences.
|MICCAI with level-sets||||95.91 (5.28)||0.880 (0.03)||2.34 (0.46)|
|||97.80 (4.70)||0.94 (0.02)||1.70 (0.37)|
|||95.91 (5.28)||0.880 (0.03)||2.34 (0.46)|
|MICCAI without level-sets||||95.71 (6.96)||0.880 (0.03)||2.34 (0.45)|
|||90.00 (10.00)||0.900 (0.10)||2.84 (0.29)|
|||90.29 (12.73)||0.880 (0.03)||2.42 (0.36)|
|FCN||94.78 (06.27)||0.902 (0.04)||2.14 (0.38)|
|RFCN||95.34 (07.20)||0.900 (0.04)||2.05 (0.29)|
|PRETERM||DBN||92.01 (8.36)||0.913 (0.02)||2.05 (0.38)|
|RDBN||97.50 (6.77)||0.909 (0.02)||1.94 (0.23)|
|FCN||97.59 (4.82)||0.916 (0.03)||1.80 (0.41)|
|RFCN||95.37 (5.69)||0.935 (0.03)||1.56 (0.31)|
In order to shed insights into the regional improvements introduced by RFCN, the Dice index was computed separately for different local regions of the LV, and the results are summarised in Table 2. Here, Base-, Base- and Base- indicates that , and slices were taken starting from the base of the heart and moving towards the middle, and analogously for the apex. All the remaining slices contributed towards the Central class. In all cases, the Dice index is calculated using all the samples at once to reflect overall pixels accuracy. In both datasets, RFCN outperforms FCN around the central slices and around the apex, as expected. However, in the MICCAI dataset, FCN yields better performance around the base of the heart. On the PRETERM dataset, both DBN and RDBN gave the worst performance, compared to FCN and RFCN, despite using focused region of interests instead of full-sized images. Here again it can be observed that RDBN improves upon DBN across all cardiac locations.
In this paper we have investigated whether a single neural network architecture, trained end-to-end, can deliver a fully-automated and accurate segmentation of the left ventricle using a stack of MR short-axis images. The proposed architecture, RFCN, learns image features that are important for the localisation of the LV in a sequential manner, going from the base to the apex of the heart, through a recurrent modification of fully convolutional networks.
Experimental findings obtained from two independent applications demonstrate that propagating information from adjacent slices can help extract improved context information with positive effect on the resulting segmentation quality. The hypothetical value of the large inter-slice correlation has been further tested by introducing a recurrent version of deep belief networks, and verified with our results showing that RDBNs generally outperform DBNs on the segmentation task, assuming the LV has already been localised. As expected, notable improvements can be seen in the delineation of cardiac contours around the apex, which are notoriously more difficult to identify.
One surprising finding was to note that performance of RFCN in apical slices was better for MICCAI than for PRETERM cohort ( vs. Dice index in the most apical slice, see Table 2), when one could expect the opposite: a regular and homogeneous cohort, PRETERM, should lead to a better performance when leveraging the inter-slice spatial dependence. This aspect will warrant further investigations.
Compared to other models, RFCN has the advantage of carrying out both LV detection and segmentation in a single architecture with clear computational benefits and the potential for real-time application. In future work, we are planning to investigate alternatives operations that can capture inter-slice correlations, such as 3D convolutions, and further extend RFCN by incorporating a bi-directional mechanism for the inclusion of an inverse path (from the apex to the base of the heart) as well as a temporal extension to handle all cardiac phases at once.
The authors would like to thank Paul Leeson and Adam Lewandowski from Oxford University for their assistance with the PRETERM dataset.
Avendi, M.R., Kheradvar, A., Jafarkhani, H.: A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Medical Image Analysis 30, 108–119 (May 2016)
-  Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 (2014)
-  Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555 (2014)
-  Georgescu, B., Zhou, X.S., Comaniciu, D., Gupta, A.: Database-guided segmentation of anatomical structures with complex appearance. In: CVPR. vol. 2, pp. 429–436 (2005)
-  Hinton, G.E., Salakhutdinov, R.R.: Reducing the Dimensionality of Data with Neural Networks. Science 313(5786), 504–507 (2006)
-  Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580 [cs] (2012)
-  Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8), 1735–1780 (1997)
Hu, H., Liu, H., Gao, Z., Huang, L.: Hybrid segmentation of left ventricle in cardiac MRI using gaussian-mixture model and region restricted dynamic programming. Magnetic Resonance Imaging 31(4), 575–584 (2013)
-  Huang, R., Pavlovic, V., Metaxas, D.N.: A graphical model framework for coupling MRFs and deformable models. vol. 2, pp. 739–746 (Jun 2004)
-  Huang, S., Liu, J., Lee, L.C., Venkatesh, S.K., Teo, L.L.S., Au, C., Nowinski, W.L.: An image-based comprehensive approach for automatic segmentation of left ventricle from cardiac short axis cine MR images. Journal of Digital Imaging 24(4), 598–608 (2011)
-  Ioffe, S., Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167 [cs] (Feb 2015)
-  Jolly, M.: Fully Automatic Left Ventricle Segmentation in Cardiac Cine MR Images Using Registration and Minimum Surfaces. MIDAS Journal 49 (2009)
Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journal of Computer Vision 1(4), 321–331 (1988)
-  Lewandowski, A.J., Augustine, D., Lamata, P., Davis, E.F., Lazdam, M., Francis, J., McCormick, K., Wilkinson, A.R., Singhal, A., Lucas, A., Smith, N.P., Neubauer, S., Leeson, P.: Preterm heart in adult life: cardiovascular magnetic resonance reveals distinct differences in left ventricular mass, geometry, and function. Circulation 127(2), 197–206 (2013)
-  Li, C., Xu, C., Gui, C., Fox, M.: Distance Regularized Level Set Evolution and Its Application to Image Segmentation. IEEE Transactions on Image Processing 19(12), 3243–3254 (2010)
-  Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation. In: CVPR (2015)
-  Ngo, T.A., Carneiro, G.: Fully Automated Non-rigid Segmentation with Distance Regularized Level Set Evolution Initialized and Constrained by Deep-Structured Inference. In: CVPR. pp. 3118–3125 (2014)
-  Ngo, T.A., Carneiro, G.: Left ventricle segmentation from cardiac MRI combining level set methods with deep belief networks. In: ICIP. pp. 695–699 (2013)
-  Ngo, T.A., Lu, Z., Carneiro, G.: Combining deep learning and level set for the automated segmentation of the left ventricle of the heart from cardiac cine magnetic resonance. Medical Image Analysis 35, 159–171 (2017)
-  Petitjean, C., Dacher, J.N.: A review of segmentation methods in short axis cardiac MR images. Medical Image Analysis 15(2), 169–184 (2011)
-  Radau, P., Lu, Y., Connelly, K., Paul, G., Dick, A.J., Wright, G.A.: Evaluation Framework for Algorithms Segmenting Short Axis Cardiac MRI. The MIDAS Journal – Cardiac MR Left Ventricle Segmentation Challenge (2009)
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: MICCAI (2015)
-  Sutskever, I., Hinton, G.E., Taylor, G.W.: The Recurrent Temporal Restricted Boltzmann Machine. In: NIPS, pp. 1601–1608 (2009)
-  Tieleman, T., Hinton, G.: Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning 4 (2012)
-  Valipour, S., Siam, M., Jagersand, M., Ray, N.: Recurrent Fully Convolutional Networks for Video Segmentation. arXiv:1606.00487 [cs] (2016)