Cardiovascular disease (CV) remains the leading cause of death worldwide . Cardiac Magnetic Resonance (CMR) is frequently used for diagnostics of various cardiac diseases . Segmentation of the Left Ventricle (LV) from CMR images provides a standard procedure for the determination of cardiac parameters which is time-consuming and requires relevant experience.
The segmentation of the heart requires the identification of two anatomical regions: the inner part called the endocardium, and the outer part, the epicardium. Identifying and segmenting those two regions in CMR images presents different levels of difficulty: while the endocardium has sufficient contrast, the epicardial surface presents profiles of intensity with little contrast .
Notwithstanding, a clear delineation of the epicardium contours is a challenging task due to non-homogeneity in the blood flow. Moreover, the presence of strong unevenness on the wall in the interior of the heart chambers due to the sporadic presence of papillary muscles does not allow a clear delimitation of the endocardial wall . The level of difficulty also depends on the particular ventricular section; the apical and basal sections are much more difficult to segment because of the resolution of the image is lower and does not allow the detection of ventricular structure .
Fully automatic segmentation algorithms have been proposed in the last decade to speed up segmentation, but the robust and reliable solution is still missing for clinical practice . Existing approaches for fully-automated LV segmentation can be divided into three groups: image-based methods, deformable models, and pixel-based classification methods. Image-based methods consist of finding the endocardium border using a threshold of gray level intensities  or Dynamic Programming (DP) . The Level-Set (LS) segmentation, such as , is the most successful choice within the methods using deformable models.
Finally, pixel classification methods, based on supervised Deep Neural Networks [10, 11, 12, 13], have become relatively successful for both object detection and segmentation. Proposed architectures include Fully Convolution Neural Networks (FCNN) 
, Stacked Autoencoders (SA)10].
Occasionally, a post-processing method is also applied, following the initial segmentation, to obtain more robust and accurate contours. In some cases, a single architecture trained end-to-end has been shown to achieve satisfactory performance without the need for further pre- and post-processing, e.g. by modeling the spatial dependence amongst SA sections . There are also architectures for segmentation of three-dimensional images, such as , which have shown to boost the performance of 2D approaches .
Also, Optical Flow (OP), i.e. the pattern of apparent motion of objects, has also been used in the segmentation context to exploit the regularity in the cardiac motion and constrain the degree of allowed LV movement between adjacent image frames. Indeed, OP equations have also been combined with the LS equations to enforce visual constancy .
In this work, we exploit a complementary source of information, the coherence across consecutive frames, and propose a Temporal Convolution Neural Network (T-FCNN) architecture with Semantic Flow post-processing. The performance of the proposed architecture is compared against an established FCNN model, which treats each cardiac phase independently and achieves good segmentation performance . An alternative post-processing choice, the use of Conditional Random Fields (CRFs) , is also investigated.
Ii TWINS-UK dataset
The TWINS-UK is a voluntary registry that includes >12,000 twins . For this study, 68 consecutive female subjects (mean age years) were recruited from the TWINS-UK cohort.
The CMR scans were performed on a 1.5-T clinical scanner (Achieva, Philips Healthcare, Best, The Netherlands). Each dataset included 12 to 14 equidistant and contiguous short-axis CINE from the atrioventricular (AV) ring to the apex, completely covering both ventricles (slice thickness 8 mm; no gap mm; field of view was mm and matrix size ).
The ECG-gated steady-state free-precession (SSFP) end-expiratory breath-hold 2D CINES were acquired. Images were acquired with 30 phases/cardiac cycle corresponding to a temporal resolution of 25-35 milliseconds at a heart rate of 60-80 beats per minute.
The dataset was randomly divided into training, validation and testing sets of sizes , and , respectively. Each pair of twins was allocated to a specific subset and not separated to avoid any genetic similarities affecting our results.
Iii Proposed analysis work-flow
The work-flow is divided into three stages: LV position detection, LV segmentation, and LV contour refinement. Our input consists of the entire temporal sequence obtained in all cardiac phases, while the label output is the corresponding sequence of binary masks, one per time point. Whereas, the index indicate the specific label pixel at the temporal image phase . Each input image was downsized to
. The output masks have the same size after a padding operation ofpixels.
Iii-a Single-frame LV position detection
For the automated detection of the LV in each time frame, we used the Over-Feat algorithm . In order to train the object detection layers in Over-Feat, we pre-trained the GoogLeNet architecture 
within the ImageNet database and later carried out fine-tuning using our LV images in a sliding window fashion. The prediction of the bounding box coordinates is obtained through regression layers minimizing an L2 loss.
Iii-B Sequence-based LV segmentation
Upon detecting the bounding box containing the heart, the cardiac contours are inferred with an architecture that extends U-Net , originally proposed for the segmentation of biomedical images.
A standard U-Net takes an individual frame as input and estimates the corresponding binary mask as output. An encoding (descending) path is composed of a sequence of hidden layers that are used to learn a representation of the input image. Each hidden layer is represented by repeated
More specifically, our solution is an improvement of a Fully Convolutional Neural Network (FCNN) , which becomes our baseline for comparisons. This is then followed by another sequence of hidden layers forming a decoding (ascending) path through which all the feature maps are gradually restored to the original image size and are used to infer the final binary masks. Similar to U-Net, the hidden layers in the encoding path, and the decoding path consisting of the convolutional (or upsampling) filters also followed by pooling and ReLU mappings.
One of the key features of this architecture is the use of skip paths connections between convolution and deconvolution layers, for the purpose of fusing global and local information. Every skip paths concatenate feature maps from the encoding to the decoding path (i.e extracting feature maps from each convolutional encoding block and concatenating with those of the decoding deconvolutional block).
For the purpose of segmenting the entire cardiac motion, we modify the U-Net architecture by adding a recurrent layer immediately after the descending path. The purpose of this layer is to leverage the information flowing from preceding image frames and enforce temporal coherence (i.e the temporal redundancy over time). Then, the feature maps learned at the encoding stage are used as inputs for a recurrent element coded with a Convolutional Gated Recurrent Unit (Conv-GRU). The resulting architecture, a Temporal Fully-Convolutional Neural Network (T-FCNN), is illustrated in Fig. 1.
The final component of the segmentation work-flow is a post-processing algorithm that has the potential to further improve upon the binary masks predicted by the T-FCNN. In this study, we compare two methods that have been particularly successful for semantic segmentation tasks, i.e. Fully Connected CRFs with Gaussian edge potentials  and Semantic-Flow (SF) . The fully connected CRFs minimise an energy function:
Where is the segmentation mask from frame , is the index of each label pixel, is the unary potential, defined as , and is the pairwise potential, defined by:
The function equals one when , zero otherwise, and the function is a Gaussian kernel, evaluated using features and corresponding to pixels and respectively. Notably, the first kernel is driven by both pixels position () and color intensities (); where the second kernel only uses pixels position. Kernel coefficients were kept fixed, and the energy function was minimized using the L-BFGS algorithm for nonlinear optimization using multi-thread CPUs. The second post-processing algorithm uses a Semantic-Flow (SF) approach, which is another alternative to exploit the temporal coherence in a sequence. Our implementation follows the formulation presented in , and the reader is referred to that paper for a detailed explanation of the concepts that are summarised next. Given a CINE MRI sequence of 2D frames
, our aim is to estimate simultaneously the flow vector fieldthat maps every pixel between consecutive 2D images. The task is divided in two regions defined by for , corresponding to LV mask and background, and vector fields are parametrised with a set of parameters . Input is , the initial LV segmentation comes form T-FCNN. We then wish to minimise the following energy function , that consists of five terms: data, motion, time, space and coupling term.
The data term measures the similarity between the gray scale intensities of adjacent frames, the motion term enforces some regularity of the vector field in pixels that belong to the same region or that are close to each other, the time term constrains the regularity of pixels belonging to a LV region or background along the sequence, enforces the connectivity of pixels within the same region, and the coupling term emphasizes the affinity between background segmentation and LV segmentation. The different are constants that weight the contribution of the energy terms, and that are left constant in our study.
Iv Experimental Results
The performance of the detection of the position of the LV was assessed with the Intersection Over Union, reaching a score of 98%. The accuracy of the final segmentation was measured using three different metrics: the Dice Index (DI), the Average Perpendicular Distance (APD) and the Conformity (C) index .
A set of architectures from the baseline FCNN to the proposed T-FCNN with CRFs was implemented and analyzed. Parameters of both T-FCNN and FCNN have tuned within adadelta optimizer, and both architectures are trained end-to-end with spatial cross entropy criterion with a learning rate of .
Results are reported in [Table I]. T-FCNN improves FCNN, reducing a 30.5% the APD metric (from to mm), and the addition of CRFs further reduces another 12% this error metric (from to mm). The CRF did improve the performance in all cases, but SF only improved the metric of APD when added to the FCNN. Some illustrative examples of the final segmentation result are provided in [Fig. 2].
|Algorithms||DICE (%)||APD (mm)||C(%)|
V Discussion and Conclusions
Exploitation of the temporal coherence across frames is a useful resource for the fully automated, robust and accurate segmentation of CINE MRI sequences, as required for clinical practice.
A CINE MRI study has a large level of coherence both in space and time. Driven by clinical interest (i.e. the generation of metrics such as the blood pool volume at a given time) and by the less burden required for the generation of the ground truth (i.e. only need to segment 10-12 slices at a given time), most of the algorithms proposed up to date exploit the coherence in space .
The use of a recurrent unit in space (i.e. across slices) did reduce the APD in a 4.2% or a 13% at the MICCAI and PRETERM cohorts as reported in , whereas the use of a recurrent unit in time (i.e. across frames) improved the APD in a 30% in our TWINS-UK cohort (see Table I
). Temporal coherence seems thus to be a more useful resource, probably due to the fact that there is indeed a larger correlation between adjacent frames in time than between adjacent slices in a CINE MRI study. And this has been achieved by a simple extension of the U-Net architecture through a recurrent neural network component (T-FCNN).
In the search for the best strategy to exploit temporal coherence, our results suggest that T-FCNN is a better solution than an FCNN+SF. We can’t claim that we fully exploited the potential of both strategies (i.e. exploration of all the parametric space in SF is a tedious exercise), but our results with a reasonable set of parameter and architecture combinations consistently suggested that SF did at most only marginally improve the LV segmentation performance.
It is important to note that the clinical protocol for the CMR segmentation in out TWINS-UK cohort included the papillary muscles as part of the myocardium, in contrast with the majority of previous studies [13, 6]. This is a more challenging task for the human observer and the algorithm, since the smoothness constraints of contouring a circular shape cannot be used, and is the reason why previous APD values were smaller (i.e. around 2mm ).
Nevertheless, the use of a recurrent unit is not a perfect solution, since it is going to be limited by the problem of the vanishing gradients that reduces the performance of the backpropagation through time. This is going to limit the number of cardiac frames that can be imputed to the architecture. In practice, this should not be a major problem, since acquisitions with more than 30 frames, as used in this study, are not common in CMR.
The translation of this concept to echocardiography will nevertheless bring some challenges, where you can deal with sequences of up to thousands of frames using ultra-fast acquisition protocols. An easy first solution will then be the use of an upper bound value to the gradient (i.e. gradient clipping) or adding a regularisation term able to increase or decreases the gradients magnitude. Finally, the use of 3D convolution kernels could be useful for progressively decrease the number of input sequences; avoiding excessively long temporal or spatial sequences.
The main practical bottleneck is the heavier burden of segmentation needed to transfer the learning to another study or experimental setting. Instead of learning from single frames, the proposed architecture will need to learn from complete sequences. The availability of semi-automatic segmentation solutions in commercial products should make this task affordable. In any case, our TWINS-UK dataset with its manual ground truth segmentation is made available under request to the community as a reference for future solutions to exploit the temporal coherence in CMR sequences.
-  Kreatsoulas C, Anand SS. The impact of social determinants on cardiovascular disease. The Canadian Journal of Cardiology.26(Suppl C):8C-13C. 2010.
-  Salerno M, Kramer CM. Advances in Cardiovascular MRI for Diagnostics: Applications in Coronary Artery Disease and Cardiomyopathies. Expert opinion on medical diagnostics, vol 3(6) pp 673-687, 2009.
-  Valliappan Raman and Patrick Then and Annuar Rapaee, A Brief Study on Automated Non Contrast Cardiac MRI Segmentation by Machine Vision Techniques,IACSIT International Journal of Engineering and Technology, vol 4, num 5, 2012.
-  Ebeling C. Barbier and L. Johansson and L. Lind and H. Ahlström and T. Bjerner, The Exactness of Left Ventricular Segmentation in Cine Magnetic Resonance Imaging and Its Impact on Systolic Function Values, Acta Radiologica, vol 48, num 3, pp. 285-291, 2007.
-  A review of segmentation methods in short axis cardiac MR images, Medical Image Analysis, vol 15, num 2, pp. 169-184, 2011
-  Olivier Bernard et. al, Deep Learning Techniques for Automatic MRI Cardiac Multi-structures Segmentation and Diagnosis: Is the Problem Solved? IEEE Transactions on Medical Imaging, p 1, DOI 10.1109/TMI.2018.2837502, 2018.
-  A. Goshtasby and D. Turner, Segmentation of cardiac cine MR images for extraction of right and left ventricular chambers, IEEE Transactions on Medical Imaging, vol. 14, no. 1, pp. 56–64, 1995.
-  J.Y. Yeh, J.C. Fu, C.C. Wu, H.M. Lin, J.W. Chai, Myocardial border detection by branch-and-bound dynamic programming in magnetic resonance images, Computer Methods and Programs in Biomedicine, vol 79, pp 19-29, 2005.
-  T. F. Chan and L. A. Vese. 2001. Active contours without edges. Trans. Img. Proc. 10, 2, 266-277, 2001.
-  Ngo, T.A., Carneiro, G., Left ventricle segmentation from cardiac MRI combining level set methods with deep belief networks. In: ICIP. pp. 695–699, 2013.
-  Avendi, M.R., Kheradvar, A., Jafarkhani, H., A combined deep-learning and deformable-model approach to fully automatic segmentation of the left ventricle in cardiac MRI. Medical Image Analysis 30, 108–119, 2016.
-  Phi Vu Tran, A Fully Convolutional Neural Network for Cardiac Segmentation in Short-Axis MRI, CoRR, col. abs/1604.00494, 2016.
-  Rudra P. K. Poudel, Pablo Lamata, Giovanni Montana, Recurrent Fully Convolutional Neural Networks for Multi-slice MRI Cardiac Segmentation, CoRR, vol. abs/1608.03974, 2016.
-  Moayyeri A, Hammond CJ, Valdes AM, Spector TD. Cohort Profile, Cohort Profile: TwinsUK and healthy ageing twin study, nt J Epidemiol. 2013 Feb;42(1):76-85.
-  Razvan Pascanu, Tomas Mikolov, Yoshua Bengio, Understanding the exploding gradient problem, CoRR, abs/1211.5063, 2012.
-  Jianxu Chen and Lin Yang and Yizhe Zhang and Mark S. Alber and Danny Z. Chen Combining Fully Convolutional and Recurrent Neural Networks for 3D Biomedical Image Segmentation, CoRR, vol. abs/1609.01006, 2016.
-  Paragios, Nikos. Hybrid optical flow and segmentation technique for lv motion detection. IEEE, IEEE transactions on medical imaging 22.6: 773-776., 2003, 2003.
-  Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, Yann LeCun, OverFeat: Integrated Recognition, Localization and Detection using Convolutional Network, CoRR, vol. abs/1312.6229, 2013
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, Going Deeper with Convolutions, CoRR, vol. abs/1409.4842, 2014.
-  Philipp Krähenbühl and Koltun, Vladlen, Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, Advances in Neural Information Processing Systems 24, J. Shawe-Taylor and R. S. Zemel and P. L. Bartlett and F. Pereira and K. Q. Weinberger, pp. 109-117, 2011.
-  Laura Sevilla-Lara and Deqing Sun and Varun Jampani and Michael J. Black, Optical Flow with Semantic Segmentation and Localized Layers, CoRR, abs/1603.03911, 2016.
-  Ronneberger, Olaf, and Fischer, Philipp and Brox, Thomas, U-Net: Convolutional Networks for Biomedical Image Segmentation Medical Image Computing and Computer-Assisted Intervention Proceedings, Part III, 2015, pp. 234-241.
-  Long, J., Shelhamer, E., Darrell, T.: Fully Convolutional Networks for Semantic Segmentation, CVPR, 2015.
-  Junyoung Chung and Çaglar Gülçehre and KyungHyun Cho and Yoshua Bengio, Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling, CoRR, 2014.