The World Health Organization identifies ischaemic heart diseases as the leading cause of death who_top_ten_cod. Imaging technologies, such as MRI, yield clinically well established parameters, including ESV, EDV, EF and SV as well as VM. In order to determine these clinical biventricular cardiac mass and function parameters, usually a skilled physician with expertise in cardiac MRI has to segment the image data. This task is typically performed in a time span of approximately 15-20 minutes per case.
Automated image segmentation, especially of the 2D short axis cardiac cine MRI stacks is a highly competitive research field. Active contour models kass_snakes_1988; osher_fronts_1988
, and machine learning approachespoudel_recurrent_2016; avendi_combined_2016; tran_fully_2016 are among the most successful methods.
One major challenge for the design of robust classifiers for automated cardiac image segmentation is the lack of manually annotated training data (ground truth). Hence, models with a high number of free parameters, such as deep neural networks, tend to overfit to the characteristics of the assembled data. Image segmentation quality for similar image morphologies is typically sufficient, however, the quality might rapidly degrade for differing image characteristics. This can be caused by varying image acquisition techniques, varying experimental protocols and image morphology altering illnesses, such as cardiomyopathies. Additionally, image segmentation philosophies have been shown to have major influence on the resulting biventricular mass and function parameters. The inclusion or exclusion of trabeculations and papillary muscles affect the left and right ventricular mass as well as the endocardial volumesievers_impact_2004; winter_evaluating_2008; vogel-claussen_left_2006; kawel-boehm_normal_2015. No convention has been universally accepted for analyzing trabeculation and papillary muscle mass schulz-menger_standardized_2013. Moreover, the amount of realizable medical images is orders-of-magnitude bigger than the number of assembled samples. Many studies limit themselves to a single data source at a time for training and validation avendi_combined_2016; queiros_fast_2014; hu_hybrid_2013; liu_automatic_2012; huang_image-based_2011. This might impair the generalization potential of corresponding models.
A recent paper by Tran et al. tran_fully_2016
applied transfer learning to adapt a pre-trained model to novel data sets. Unfortunately, a major drawback of this approach is the time-consuming retraining of the neural network.
Our study investigates the generalization potential of cardiac image segmentation. In detail, we have composed a diverse data set consisting of images with highly varying characteristics and further applied non-linear augmentation techniques to artificially increase the number of training samples by orders-of-magnitude. We further demonstrate how to determine fully automated high quality estimates of clinical parameters, such as end-systolic volume (ESV), end-diastolic volume (EDV), ejection fraction (EF), stroke volume (SV), and ventricular mass (VM).
2 Materials and Methods
2.1 Evaluation Measures
In this study we evaluate the performance of the proposed cardiac segmentation approach by determining the clinical gold standard parameters EF, ESV, EDV as well as VM of the left and right ventricle. Furthermore, we compare the automatically computed image segmentation with the ground truth in terms of similarity measures, such as overlap, and dice.
2.1.1 Evaluation of Biventricular Cardiac Mass and Function
The performance of the computed segmentation is determined by calculating EF, ESV and EDV. In order to determine the EF in a 2D short axis cine MRI stack, the ESV and EDV are typically measured by segmenting the corresponding volumes as described by winter_evaluating_2008 (Simpson’s method). The EF is the fraction of the EDV, which is ejected with every cardiac cycle, i.e. the normalized difference of the corresponding volumes at end-systole and end-diastole:
where denotes the ESV and refers to the EDV.
The VM is calculated as product of a constant conversion density factor and the volume of the right () or the left () ventricle, respectively:
where is a phenomenologically determined conversion factor of 1.05 g/cm. Note that the VM is usually determined at the end-diastole.
The clinical parameters are gathered for the manual ground truth and the automatically computed prediction by counting the corresponding voxels. Volumes and other derived quantities are compared in terms of Spearman’s rank correlation coefficient, the RMSD, MAPE, ME as well as ICC.
2.1.2 Evaluation of Technical Parameters
In this study we present the expected performance metrics in order to evaluate the quality of the automated cardiac segmentation. These metrics include geometrical measures quantifying overlap based on the dice and traditional Machine Learning metrics, such as accuracy, precision, recall, and specificity.
The dice is proportional to the ratio of intersection between two volumes divided by the sum of said volumes:
All performance measures are carried out on calculated 3D spatial volumes.
2.2 Data Sets
The experiments were performed on four independent 2D short axis cine cardiac MRI data sets as depicted in Figure 1:
2.2.1 mhh Data Set
The data set consists of 193 training and 309 validation 2D ssfp short axis cine MRI stacks. The end-systolic and end-diastolic contours have been created by a senior radiologist (15 year experience in cardiac MRI) and are accepted as ground truth for this study. Image acquisition was performed on three different 1.5 T MRI scanners of a single vendor (Siemens Healthineers, Erlangen, Germany) using a ssfp sequence with a slice thickness of 8 mm, no gap, an acquisition matrix of 256 208 pixels with an in-plane spatial resolution of 1.4 1.4 mm.
2.2.2 kaggle Data
The data set is composed of 500 training, 440 testing, and 200 validation 2D short axis cine stacks. It has been compiled by the NIH and the CNMC. It is at least one order-of-magnitude more extensive than any data set previously released. Each stack contains approximately 30 images over the cardiac cycle. The data set does not include a ground truth segmentation, instead, the end-systolic and end-diastolic volumes of the LV are provided. This data set has been subject of the Second Annual Data Science Bowl contest.
Our study design does not require the fine-grained separation of the data into a test, training, and validation set as provided. Therefore, the validation and test set have been merged in order to extend the validation set to 640 cases. The 2D short axis cine MRI stacks have been converted into 4D matrices, stored in the nifti format. This process failed for 14 of the training and 38 of the validation cases due to inconsistent image dimensions, resulting in 486 training and 602 validation nifti files.
The training set has been used twice. First, 60 cases with high visual diversity have been manually segmented and subsequently included in the training process of the neural network. Second, all 486 training cases have been utilized to fit a linear regression for the adjustment of the clinical parameters.
2.2.3 LVSC Data Set
The MICCAI 2009 LV Segmentation Challenge MICCAI_2009_challange_database data set has been published by the Sunnybrook Health Sciences Center (Canada). This data set has been utilized in two separate experiments:
In the first experiment, this data set is used exclusively for validation purposes in this study, i.e. none of the images is used for training the neural network. This decision has been made in order to explore the generalization potential of the network and is being referred to as the ad-hoc performance. The data set consists of 45 cases. Ground truth segmentation is available for the end-diastole and the end-systole. It is composed of 12 heart failure with infarction, 12 heart failure without infarction, 12 LV hypertrophy patients and 9 healthy subjects. The data was split into three parts: 15 for training, 15 for testing, and 15 for validation in an on-line contest. Data conversion has failed for one case (SC-HYP-37). As mentioned before, no training images of this data set have been used for this study. Therefore, all available images have been assembled into a single validation set consisting of 44 cases with end-systolic and end-diastolic segmentation, yielding a total of 88 2D image stacks.
In the second experiment, the data set has been split into 29 training and 15 validation images. A network, pretrained on the regime as depicted in Figure 1, has been retrained on this specific split, in order to evaluate the performance in contrast to the ad hoc results, and referred to as results with retraining.
Image acquisition was performed on a 1.5 T GE Signa MRI scanner from the atrioventricular ring to the apex in 6 to 12 2D short axis cine stacks with a slice thickness of 8 mm, a gap of 8 mm, a field of view of 320 320 mm with a matrix resolution of 256 256.
This data set is relevant since it has been extensively used in prior studies for training and evaluation. The end-systolic and end-diastolic contours are provided for the endo- and epicardial volume.
2.2.4 RVSC Data Set
The RVSC was held at the MICCAI 2012 conference. For this challenge a data set consisting of 16 training and 32 validation 2D short axis cine MRI stacks were acquired. Contour data is only provided for the training set, the validation contours are withheld by the authors in order to ensure an independent validation.
For this study we utilize the training set for validation, as contour information is only available for the training set. However, it has to be stressed, that no image or segmentation data of the training set has been used in the training process of the neural network. All 16 training cases have been segmented at the end-systolic and end-diastolic phase, yielding a total of 32 2D image stacks. Throughout the rest of this paper we refer to the actual training set as the validation set of this study.
Image acquisition has been performed on a 1.5 T scanner (Symphony Tim, Siemens Medical Systems, Erlangen, Germany). Retrospectively gated balanced ssfp cine MRI sequences were performed for analysis with repeated breath-holds of 10-15 s. A total of 8-12 2D short axis cine planes were acquired from the base to the apex of the ventricles. The temporal resolution of the cine images is 20 images per cardiac cycle. All images have been zoomed and cropped to a 256 216 or 216 256 resolution, leaving the LV visible.
2.3 Network Topology
The topology of -net (/nju:nt/), the neural network of this study, is depicted in Figure 2. It is derived from the U-Net architecture ronneberger_u-net_2015
and has been implemented in TensorFlowtensorflow_2015. The input layer has been resized to 256
256 neurons with a consecutive downsampling of the subsequent layers using a 33 convolution with a 2
2 striding in contrast to a max-pooling operation of the original U-Net architecture. The padding has been changed from valid to same, resulting in an output layer of the same size as the input layer. All activation functions have been changed from a relu to a prelu as proposed by He et al.he_delving_2015.
2.4 Network Training
The neural network was trained by minimizing binary cross-entropy as objective function. Backpropagation was used to compute the gradients of the cross-entropy loss. The model was initialized with random values sampled from a uniform distribution without scaling variance (uniform scaling) as proposed by Glorot and Bengioglorot_understanding_2010
. Adaptive Moment Estimation (Adam)kingma_adam_2014 was chosen as stochastic optimization method. The initial learning rate of 10 was gradually reduced down to 10 during training.
The training set is assembled from 3,519 2D images (2,894 Hannover Medical School and 625 Data Science Bowl Cardiac Challenge images). Data augmentation has been applied to artificially inflate the training set as described in Section 2.5. One complete training run of -net takes about 24 to 36 hours.
2.5 Data Augmentation
Data augmentation is often used to artificially inflate the training data set simard_best_2003; ciresan_high-performance_2011; ciregan_multi-column_2012. This technique generates similar images and their corresponding segmentations from already existing data by applying local spatial transformations to them. The key idea behind this approach is that a slightly deformed heart should be identified by the neural network in a similar manner. Due to the image augmentation in the training phase, the algorithm provides good segmentation results regardless of the orientation, scale and parity of the input image. Translational equivariance is inherited from the convolutional network architecture, i.e. a transformed heart is mapped pixel-wise on the corresponding transformed segmentation in the spatial domain.
In order to fully utilize the computational resources of modern workstations equipped with multicore CPU and CUDA-enabled GPU, we have developed an auxiliary library that facilitates parallel image augmentation on the CPU while training of the neural network is delegated to the GPU. The time needed for image augmentation can completely be hidden since concurrent augmentation of a batch of input images can be accomplished faster than the actual training step. Hence, augmentation and training are performed efficiently in an interleaved manner: the next batch of images is augmented on the CPU while the current batch is still being trained on the GPU.
The library is implemented in the C++ programming language and the OpenMP extension. It features convenient bindings for the Python programming language which seamlessly interact with the TensorFlow framework. Moreover, we can apply highly non-linear, and computationally expensive local deformations to the input data as well as the ground truth segmentations, due to the aforementioned efficient interleaving. Besides traditional global transformations from the affine group such as scaling, shearing, rotation, mirroring, and translations, we allow for the pixel-wise deformation of the spatial pixel domain
where are second degree multivariate polynomials in the pixel coordinates and
f^(±)(i,j) :&= a_i^(±)⋅i &&+ a_j^(±) ⋅j + a_ij^(±) ⋅ij
&+ a_ii^(±) ⋅i^2 &&+ a_jj^(±) ⋅j^2 and the coefficients are sampled from a uniform distribution over the closed interval . The hyper-parameter controls the amount of deformation. The special case
refers to no deformation. Fractional indices are mapped via bilinear interpolation. Figure3 shows an exemplary augmentation of an MRI.
This study explores the possibility of creating a general purpose cardiac image segmentation model, capable of reliably producing high quality segmentations, independent of aspects such as different image acquisition techniques, and diverse MRI protocols. For this purpose, the model was trained on a proprietary data set (mhh) as well as a small subsample of the kaggle training set. The goal was to learn the specific concept of cardiac segmentation from the highly standardized mhh data set as well as abstracting a more general notion for different image morphologies from the heterogeneous kaggle data set and, in turn, prevent overfitting on the characteristics of a specific data set, as depicted in Figure 4. The resulting intraclass correlation coefficients for all evaluated data sets are listed in Table 2.
The agreement of the predicted segmentations with the ground truth is high for the mhh data set with a Spearman’s rho of 0.98 for an overall of segmented volumes of the left and right endocardium as well as VM as depicted in Figure 5. This is supported by a dice of about 90 % as depicted in Table 1 as well as an high ICC of 92-99 % for the ESV and EDV of both ventricles. A lower dice of 78 % is obtained for the right VM. Further results, such as dice, overlap, and accuracy are listed in Table 3. A selection of images of the mhh, LVSC, and RVSC data sets is illustrated in Figure 6.
|method||V||DSC (epi)||DSC (endo)|
|Hannover Medical School|
|proposed||LV||952 %||924 %|
|proposed||RV||904 %||886 %|
|MICCAI 2009 LV Segmentation Challenge|
|proposed (ad-hoc)||LV||933 %||847 %|
|proposed (with retraining)||LV||953 %||943 %|
|ngo_combining_2017||LV||932 %||883 %|
|tran_fully_2016||LV||961 %||923 %|
|queiros_fast_2014||LV||942 %||905 %|
|hu_hybrid_2013||LV||942 %||893 %|
|liu_automatic_2012||LV||942 %||883 %|
|huang_image-based_2011||LV||932 %||894 %|
|Right Ventricular Segmentation Challenge|
|proposed (ad-hoc)||RV||866 %||857 %|
|tran_fully_2016||RV||8611 %||8421 %|
|zuluaga_multi-atlas_2013||RV||8022 %||7625 %|
|wang2012simple||RV||6335 %||5934 %|
|ou_multi-atlas_2012||RV||6327 %||5831 %|
The predicted segmentation can be used to directly compute the clinical parameters on mhh data by applying the Simpson’s method winter_evaluating_2008. The same approach cannot be performed on MRI stemming from different sources, because of different segmentation philosophies, resulting in differing clinical measurements for the same case. To compensate for this variation, a linear regression has been performed to adapt the predicted ESV and EDV of the kaggle data set. This fit has been determined using standard linear regression, mapping the predicted volumes of the training set onto the ground truth scalar volumes, whilst omitting the vertical intercept. Neither the model in training, nor the linear regressor were fitted to any of the validation cases in order to eliminate the possibility of leakage. As depicted in Figure 7 the agreement with the ground truth is high with a Spearman’s rho of 0.96, an ICC of 92/94 % (ESV/EDV), and a MAPE of 14 %.
The performance of the proposed method would have ranked under the best 20 competitors of the Second Annual Data Science Bowl with a CRP score of 0.0154. Figure 8 depicts the weakest segmentation, based on the CRP score, for additional illustration.
The correlation of the left and right epi- and endocardial volumes of the LVSC and RVSC data sets is high with a Spearman’s rho of 0.95 as depicted in Figures 11 and 12 as well as ICC values of 96/97 % (ESV/EDV) for the left and 92/96 % for the right ventricle. A lower rho of 0.83 is expected for the right epicardium. dice for a selection of relevant studies are presented in Table 1. RMSD, ME and MAPE were calculated after adjusting the resulting volumes with a linear regression, fitted on the validation-set. This was necessary, as the validation sets are too small to perform a reasonable split. A random sample of images and the corrosponding segmentation from the LVSC and RVSC data sets is illustrated in Figures 9 and 10.
There are many philosophies on how to perform image segmentation of cardiac MRI. These philosophies have major influence on the resulting biventricular mass and function parameters. For example, the inclusion or exclusion of trabeculations and papillary muscles affect the left and right ventricular mass as well as the endocardial volume sievers_impact_2004; winter_evaluating_2008; vogel-claussen_left_2006. No convention has been universally accepted for analyzing trabeculation and papillary muscle mass schulz-menger_standardized_2013.
Typically, these conventions are enforced on an institute basis. This poses a major challenge for the automated image segmentation with deep neural networks, because of the resulting difference of the clinical measurements. In order to learn the specific segmentation characteristics of a site, a specifically tailored model would have to be trained. This renders the whole process impractical.
However, most of the time, image segmentation is only means to the end of determining the clinical parameters by measuring the ventricular volumes. These parameters include the VM, ESV as well as EDV, which yield the EF and SV. These established clinical parameters are of great importance in early detection of cardiac illnesses as well as treatment monitoring.
The hypothesis of this study is that the clinical parameters can be reliably measured by adapting a pre-trained neural network to a new environment and applying one of the most basic statistical models, a linear regression. This result is unexpected since differences in segmentation guidelines are usually of local nature and do not necessarily need to exhibit a linear dependency on the final measurement. In order to substantiate this assumption, -net was trained on the extensive mhh data set. -net demonstrates state-of-the-art performance, as depicted in the results Figure 5. Furthermore, -net was trained on a small, hand-picked subset of the kaggle training set. This was done with the idea of transfer learning in mind, in order to convey an understanding of different image acquisition methods, and varying image morphologies to the neural network.
-net was benchmarked against the kaggle validation set. As depicted in Figure 7, it was possible to adapt the results of the classifier for the specific data set by employing a linear regression, fitted on the training set.
The performance of the proposed method would have ranked under the best 20 competitors of the Second Annual Data Science Bowl. A strong correlation between predicted and ground truth volumes was observed with a Spearman’s rho of 0.965 and a MAPE of 14.4 %. Nevertheless, the final predicted volumes of the proposed solution could be fine-tuned using high-level machine learning regressors as performed by the winning solution. This study, however, explicitly omits sophisticated post-processing and use of meta data (such as gender, age, height, and scanner geometry) in order to avoid over-fitting to a specific experimental setting. This could reasonably impair the neural network’s generalization potential.
The hypothesis is further substantiated by the results of the LVSC and RVSC data sets. As depicted in Figures 11 and 12 the segmentation results demonstrate a strong correlation with the manual segmentation exhibiting a Spearman’s rho of approximately 0.95. Furthermore, the corresponding ICC imply human level performance in determining clinical parameters. -net surpasses human level in predicting parameters for the right ventricle. Note that volume prediction was performed without training on a single image of these data sets.
-net demonstrates a state-of-the-art ad-hoc segmentation performance in terms of dice for the epi- and endocardium of the RVSC and the epicardium of the LVSC data set compared to tran_fully_2016 and avendi_fully_2016 as well as other studies avendi_combined_2016; ngo_combining_2017; poudel_recurrent_2016; queiros_fast_2014; hu_hybrid_2013; liu_automatic_2012; huang_image-based_2011; zuluaga_multi-atlas_2013; wang2012simple; ou_multi-atlas_2012. This is remarkable, as -net was not trained on any images of the aforementioned data sets. The slightly weaker ad-hoc dice of 84 % for the endocardium of the LVSC could reflect different segmentation philosophies compared to the original MHH data set, resulting in a systematic error as shown in Figure 11. This hypothesis is supported by corresponding high ad-hoc ICC values for the ESV, EDV, EF, and SV. Furthermore, additional retraining results in state-of-the-art performance for the endocardial dice.
Regarding the MHH data set, -net achieves comparable or higher agreement with the ground truth than two human observers agree on average in measuring biventricular mass and function parameters caudron_cardiac_2012. Furthermore, -net accomplishes comparable to human performance on the LVSC and RVSC data sets. -net outperforms a human by a wide margin especially at the task of gauging the right ventricular endocardial volume and ventricular mass. A slightly lower ICC score of the left endocardial volumes on the kaggle data set is most likely due to a multi-center and multi-observer setting, resulting in a inherent heterogeneity in the data set. In order to improve the performance, the results would have to be evaluated for each observer independently.
One limitation of this study is the small size of openly available data sets. The LVSC and RVSC contain 61 cases with freely accessible contours. Furthermore, the aforementioned data sets include the segmentation of the LV or RV exclusively. Additionally, training and validating on a single center data set bears the risk of overfitting as demonstrated in Figure 4. Therefore, a large, fully labeled, multi-center, multi-reader data set would be advantageous. This data set could be used to train and evaluate future models.
This study demonstrates the reliability of automatically determining clinical cardiac parameters such as ESV, EDV, EF, SV and VM on 4 data sets. Especially in the RV the neuronal network outperformed the human cardiac expert in the presented study, which likely enables more reliable RV mass and function measurements for improved clinical treatment monitoring in the future. Furthermore, it is demonstrated that the aforementioned parameters, resulting of an image segmentation by a pre-trained neural network, can be adjusted by performing a linear regression. This effectively eliminates the associated costs of introducing a neural network for determining clinical cardiac parameters in a new setting.
COMPETENCY IN MEDICAL KNOWLEDGE: Deep learning can reliably determine high quality fully-automated cardiac segmentation for precise determination of clinically well established biventricular mass and function parameters in a multi center setting.
TRANSLATIONAL OUTLOOK: The presented neuronal network is ready to be used in large scale, multi center, multi reader data sets for cost- and time efficient analysis of cardiac mass and function parameters.
The authors thank all parties involved in the data acquisition process, especially Frank Schröder and Lars Kähler.
Appendix A Figures and Tables
|LV endocardial||99.90.0 %||99.90.0 %||91.16.6 %||92.35.1 %||91.54.3 %||84.66.9 %|
|LV mass||99.80.1 %||99.90.0 %||89.64.3 %||88.34.0 %||88.82.9 %||80.14.6 %|
|LV epicardial||99.80.1 %||99.90.1 %||94.83.4 %||94.32.7 %||94.52.0 %||89.63.4 %|
|RV endocardial||99.90.1 %||99.90.0 %||89.46.2 %||86.77.9 %||87.75.5 %||78.58.1 %|
|RV mass||99.80.1 %||99.90.0 %||80.48.2 %||72.47.1 %||75.85.9 %||61.47.5 %|
|RV epicardial||99.80.1 %||99.90.0 %||92.94.1 %||87.25.9 %||89.83.8 %||81.86.0 %|