Automated segmentation on the entire cardiac cycle using a deep learning work-flow

08/31/2018 ∙ by Nicolo' Savioli, et al. ∙ King's College London University of Warwick 0

The segmentation of the left ventricle (LV) from CINE MRI images is essential to infer important clinical parameters. Typically, machine learning algorithms for automated LV segmentation use annotated contours from only two cardiac phases, diastole, and systole. In this work, we present an analysis work-flow for fully-automated LV segmentation that learns from images acquired through the cardiac cycle. The workflow consists of three components: first, for each image in the sequence, we perform an automated localization and subsequent cropping of the bounding box containing the cardiac silhouette. Second, we identify the LV contours using a Temporal Fully Convolutional Neural Network (T-FCNN), which extends Fully Convolutional Neural Networks (FCNN) through a recurrent mechanism enforcing temporal coherence across consecutive frames. Finally, we further defined the boundaries using either one of two components: fully-connected Conditional Random Fields (CRFs) with Gaussian edge potentials and Semantic Flow. Our initial experiments suggest that significant improvement in performance can potentially be achieved by using a recurrent neural network component that explicitly learns cardiac motion patterns whilst performing LV segmentation.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Cardiovascular disease (CV) remains the leading cause of death worldwide [1]. Cardiac Magnetic Resonance (CMR) is frequently used for diagnostics of various cardiac diseases [2]. Segmentation of the Left Ventricle (LV) from CMR images provides a standard procedure for the determination of cardiac parameters which is time-consuming and requires relevant experience.

The segmentation of the heart requires the identification of two anatomical regions: the inner part called the endocardium, and the outer part, the epicardium. Identifying and segmenting those two regions in CMR images presents different levels of difficulty: while the endocardium has sufficient contrast, the epicardial surface presents profiles of intensity with little contrast [3].

Notwithstanding, a clear delineation of the epicardium contours is a challenging task due to non-homogeneity in the blood flow. Moreover, the presence of strong unevenness on the wall in the interior of the heart chambers due to the sporadic presence of papillary muscles does not allow a clear delimitation of the endocardial wall [4]. The level of difficulty also depends on the particular ventricular section; the apical and basal sections are much more difficult to segment because of the resolution of the image is lower and does not allow the detection of ventricular structure [5].

Fully automatic segmentation algorithms have been proposed in the last decade to speed up segmentation, but the robust and reliable solution is still missing for clinical practice [6]. Existing approaches for fully-automated LV segmentation can be divided into three groups: image-based methods, deformable models, and pixel-based classification methods. Image-based methods consist of finding the endocardium border using a threshold of gray level intensities [7] or Dynamic Programming (DP) [8]. The Level-Set (LS) segmentation, such as [9], is the most successful choice within the methods using deformable models.

Finally, pixel classification methods, based on supervised Deep Neural Networks [10, 11, 12, 13], have become relatively successful for both object detection and segmentation. Proposed architectures include Fully Convolution Neural Networks (FCNN) [12]

, Stacked Autoencoders (SA)


and Deep Belief Networks


Occasionally, a post-processing method is also applied, following the initial segmentation, to obtain more robust and accurate contours. In some cases, a single architecture trained end-to-end has been shown to achieve satisfactory performance without the need for further pre- and post-processing, e.g. by modeling the spatial dependence amongst SA sections [13]. There are also architectures for segmentation of three-dimensional images, such as [16], which have shown to boost the performance of 2D approaches [6].

Also, Optical Flow (OP), i.e. the pattern of apparent motion of objects, has also been used in the segmentation context to exploit the regularity in the cardiac motion and constrain the degree of allowed LV movement between adjacent image frames. Indeed, OP equations have also been combined with the LS equations to enforce visual constancy [17].

In this work, we exploit a complementary source of information, the coherence across consecutive frames, and propose a Temporal Convolution Neural Network (T-FCNN) architecture with Semantic Flow post-processing. The performance of the proposed architecture is compared against an established FCNN model, which treats each cardiac phase independently and achieves good segmentation performance [12]. An alternative post-processing choice, the use of Conditional Random Fields (CRFs) [20], is also investigated.

Ii TWINS-UK dataset

The TWINS-UK is a voluntary registry that includes >12,000 twins [14]. For this study, 68 consecutive female subjects (mean age years) were recruited from the TWINS-UK cohort.

The CMR scans were performed on a 1.5-T clinical scanner (Achieva, Philips Healthcare, Best, The Netherlands). Each dataset included 12 to 14 equidistant and contiguous short-axis CINE from the atrioventricular (AV) ring to the apex, completely covering both ventricles (slice thickness 8 mm; no gap mm; field of view was mm and matrix size ).

The ECG-gated steady-state free-precession (SSFP) end-expiratory breath-hold 2D CINES were acquired. Images were acquired with 30 phases/cardiac cycle corresponding to a temporal resolution of 25-35 milliseconds at a heart rate of 60-80 beats per minute.

The dataset was randomly divided into training, validation and testing sets of sizes , and , respectively. Each pair of twins was allocated to a specific subset and not separated to avoid any genetic similarities affecting our results.

Iii Proposed analysis work-flow

The work-flow is divided into three stages: LV position detection, LV segmentation, and LV contour refinement. Our input consists of the entire temporal sequence obtained in all cardiac phases, while the label output is the corresponding sequence of binary masks, one per time point. Whereas, the index indicate the specific label pixel at the temporal image phase . Each input image was downsized to

. The output masks have the same size after a padding operation of


Iii-a Single-frame LV position detection

For the automated detection of the LV in each time frame, we used the Over-Feat algorithm [18]. In order to train the object detection layers in Over-Feat, we pre-trained the GoogLeNet architecture [19]

within the ImageNet database and later carried out fine-tuning using our LV images in a sliding window fashion. The prediction of the bounding box coordinates is obtained through regression layers minimizing an L2 loss.

Iii-B Sequence-based LV segmentation

Upon detecting the bounding box containing the heart, the cardiac contours are inferred with an architecture that extends U-Net [22], originally proposed for the segmentation of biomedical images.

A standard U-Net takes an individual frame as input and estimates the corresponding binary mask as output. An encoding (descending) path is composed of a sequence of hidden layers that are used to learn a representation of the input image. Each hidden layer is represented by repeated

convolution operation following by pooling and Rectified Linear Unit (ReLU).

More specifically, our solution is an improvement of a Fully Convolutional Neural Network (FCNN) [23], which becomes our baseline for comparisons. This is then followed by another sequence of hidden layers forming a decoding (ascending) path through which all the feature maps are gradually restored to the original image size and are used to infer the final binary masks. Similar to U-Net, the hidden layers in the encoding path, and the decoding path consisting of the convolutional (or upsampling) filters also followed by pooling and ReLU mappings.

One of the key features of this architecture is the use of skip paths connections between convolution and deconvolution layers, for the purpose of fusing global and local information. Every skip paths concatenate feature maps from the encoding to the decoding path (i.e extracting feature maps from each convolutional encoding block and concatenating with those of the decoding deconvolutional block).

For the purpose of segmenting the entire cardiac motion, we modify the U-Net architecture by adding a recurrent layer immediately after the descending path. The purpose of this layer is to leverage the information flowing from preceding image frames and enforce temporal coherence (i.e the temporal redundancy over time). Then, the feature maps learned at the encoding stage are used as inputs for a recurrent element coded with a Convolutional Gated Recurrent Unit (Conv-GRU)

[24]. The resulting architecture, a Temporal Fully-Convolutional Neural Network (T-FCNN), is illustrated in Fig. 1.

The encoding path consists of four blocks of hidden layers, which are arranged as follows: two

convolutional layers (with stride set to

), a ReLU layer, a BN layer and, finally, a max pooling layer (with stride set to ).

Fig. 1:

T-FCNN architecture. Blue blocks represent a convolutional layer followed by ReLU operations; orange blocks represent batch normalization operations; green blocks correspond to up-convolution operations; pink blocks max-pooling operations and purple blocks identify padding operations. Black arrows represent the input and output for each CINE MRI frame and its temporal segmentation. Gray arrows denote copy operations and the yellow arrow indicates the recurrent connection implemented through a Conv-GRU layer.

Iii-C Post-processing

The final component of the segmentation work-flow is a post-processing algorithm that has the potential to further improve upon the binary masks predicted by the T-FCNN. In this study, we compare two methods that have been particularly successful for semantic segmentation tasks, i.e. Fully Connected CRFs with Gaussian edge potentials [20] and Semantic-Flow (SF) [21]. The fully connected CRFs minimise an energy function:

Where is the segmentation mask from frame , is the index of each label pixel, is the unary potential, defined as , and is the pairwise potential, defined by:

The function equals one when , zero otherwise, and the function is a Gaussian kernel, evaluated using features and corresponding to pixels and respectively. Notably, the first kernel is driven by both pixels position () and color intensities (); where the second kernel only uses pixels position. Kernel coefficients were kept fixed, and the energy function was minimized using the L-BFGS algorithm for nonlinear optimization using multi-thread CPUs. The second post-processing algorithm uses a Semantic-Flow (SF) approach, which is another alternative to exploit the temporal coherence in a sequence. Our implementation follows the formulation presented in [21], and the reader is referred to that paper for a detailed explanation of the concepts that are summarised next. Given a CINE MRI sequence of 2D frames

, our aim is to estimate simultaneously the flow vector field

that maps every pixel between consecutive 2D images. The task is divided in two regions defined by for , corresponding to LV mask and background, and vector fields are parametrised with a set of parameters . Input is , the initial LV segmentation comes form T-FCNN. We then wish to minimise the following energy function , that consists of five terms: data, motion, time, space and coupling term.

The data term measures the similarity between the gray scale intensities of adjacent frames, the motion term enforces some regularity of the vector field in pixels that belong to the same region or that are close to each other, the time term constrains the regularity of pixels belonging to a LV region or background along the sequence, enforces the connectivity of pixels within the same region, and the coupling term emphasizes the affinity between background segmentation and LV segmentation. The different are constants that weight the contribution of the energy terms, and that are left constant in our study.

Fig. 2: a) Comparison of the segmentations obtained from FCNN (blue line) vs T-FCNN (green line) compared within clinical ground truth (red line). The left column shows the top slices LV segmentation. While the right column shows the apex LV segmentation cases. Both, show that T-FCNN has good segmentation performance in comparison with FCNN. Especially FCNN, it tends to segment only high-intensity regions; as we can see in the apical cases. b) Comparison of the segmentation obtained within no post-processing methods vs post-processing. As we can see the Semantic-Flow (Semantic flow column) tends to blunt the LV prediction, while the CRF (CRF column) remains fairly consistent with T-FCNN (no post-processing column).

Iv Experimental Results

The performance of the detection of the position of the LV was assessed with the Intersection Over Union, reaching a score of 98%. The accuracy of the final segmentation was measured using three different metrics: the Dice Index (DI), the Average Perpendicular Distance (APD) and the Conformity (C) index [11].

A set of architectures from the baseline FCNN to the proposed T-FCNN with CRFs was implemented and analyzed. Parameters of both T-FCNN and FCNN have tuned within adadelta optimizer, and both architectures are trained end-to-end with spatial cross entropy criterion with a learning rate of .

Results are reported in [Table I]. T-FCNN improves FCNN, reducing a 30.5% the APD metric (from to mm), and the addition of CRFs further reduces another 12% this error metric (from to mm). The CRF did improve the performance in all cases, but SF only improved the metric of APD when added to the FCNN. Some illustrative examples of the final segmentation result are provided in [Fig. 2].

Algorithms DICE (%) APD (mm) C(%)
FCNN 0.9745(0.0163) 10.2734(12.7622) 0.9472(0.0352)
T-FCNN 0.9803(0.0263) 7.1427(11.0284) 0.9583(0.0592)
FCNN+CRFs 0.9774(0.0161) 8.2084 (8.3545) 0.9532(0.0345)
T-FCNN+CRFs 0.9815(0.0245) 6.2903(8.3814) 0.9610 (0.0551)
FCNN+SF 0.9735(0.0506) 9.3289(13.4287) 0.8872(1.2804)
T-FCNN+SF 0.9717(0.0743) 8.1635(12.8057) 0.8213(1.7954)
FCNN+CRFs+SF 0.9762(0.0462) 8.4325(11.1094) 0.8939(1.2843)
T-FCNN+CRFs+SF 0.9737(0.0735) 7.5983(11.5803) 0.8097(1.9614)
TABLE I: Segmentation performance results obtained on the TwinsUK dataset using different combinations of segmentation (FCNN and T-FCNN) and post-processing (CRF and SF) algorithms.

V Discussion and Conclusions

Exploitation of the temporal coherence across frames is a useful resource for the fully automated, robust and accurate segmentation of CINE MRI sequences, as required for clinical practice.

A CINE MRI study has a large level of coherence both in space and time. Driven by clinical interest (i.e. the generation of metrics such as the blood pool volume at a given time) and by the less burden required for the generation of the ground truth (i.e. only need to segment 10-12 slices at a given time), most of the algorithms proposed up to date exploit the coherence in space [13].

The use of a recurrent unit in space (i.e. across slices) did reduce the APD in a 4.2% or a 13% at the MICCAI and PRETERM cohorts as reported in [13], whereas the use of a recurrent unit in time (i.e. across frames) improved the APD in a 30% in our TWINS-UK cohort (see Table I

). Temporal coherence seems thus to be a more useful resource, probably due to the fact that there is indeed a larger correlation between adjacent frames in time than between adjacent slices in a CINE MRI study. And this has been achieved by a simple extension of the U-Net architecture through a recurrent neural network component (T-FCNN).

In the search for the best strategy to exploit temporal coherence, our results suggest that T-FCNN is a better solution than an FCNN+SF. We can’t claim that we fully exploited the potential of both strategies (i.e. exploration of all the parametric space in SF is a tedious exercise), but our results with a reasonable set of parameter and architecture combinations consistently suggested that SF did at most only marginally improve the LV segmentation performance.

It is important to note that the clinical protocol for the CMR segmentation in out TWINS-UK cohort included the papillary muscles as part of the myocardium, in contrast with the majority of previous studies [13, 6]. This is a more challenging task for the human observer and the algorithm, since the smoothness constraints of contouring a circular shape cannot be used, and is the reason why previous APD values were smaller (i.e. around 2mm [13]).

Nevertheless, the use of a recurrent unit is not a perfect solution, since it is going to be limited by the problem of the vanishing gradients that reduces the performance of the backpropagation through time. This is going to limit the number of cardiac frames that can be imputed to the architecture. In practice, this should not be a major problem, since acquisitions with more than 30 frames, as used in this study, are not common in CMR.

The translation of this concept to echocardiography will nevertheless bring some challenges, where you can deal with sequences of up to thousands of frames using ultra-fast acquisition protocols. An easy first solution will then be the use of an upper bound value to the gradient (i.e. gradient clipping) or adding a regularisation term able to increase or decreases the gradients magnitude

[15]. Finally, the use of 3D convolution kernels could be useful for progressively decrease the number of input sequences; avoiding excessively long temporal or spatial sequences.

The main practical bottleneck is the heavier burden of segmentation needed to transfer the learning to another study or experimental setting. Instead of learning from single frames, the proposed architecture will need to learn from complete sequences. The availability of semi-automatic segmentation solutions in commercial products should make this task affordable. In any case, our TWINS-UK dataset with its manual ground truth segmentation is made available under request to the community as a reference for future solutions to exploit the temporal coherence in CMR sequences.