ν-net: Deep Learning for Generalized Biventricular Cardiac Mass and Function Parameters

06/14/2017 ∙ by Hinrich B Winther, et al. ∙ 0

Background: Cardiac MRI derived biventricular mass and function parameters, such as end-systolic volume (ESV), end-diastolic volume (EDV), ejection fraction (EF), stroke volume (SV), and ventricular mass (VM) are clinically well established. Image segmentation can be challenging and time-consuming, due to the complex anatomy of the human heart. Objectives: This study introduces ν-net (/nju:nεt/) -- a deep learning approach allowing for fully-automated high quality segmentation of right (RV) and left ventricular (LV) endocardium and epicardium for extraction of cardiac function parameters. Methods: A set consisting of 253 manually segmented cases has been used to train a deep neural network. Subsequently, the network has been evaluated on 4 different multicenter data sets with a total of over 1000 cases. Results: For LV EF the intraclass correlation coefficient (ICC) is 98, 95, and 80 (human expert ICCs reported in parenthesis). The LV VM ICC is 95, and 94 adjustment procedure, allowing for the adaptation to distinct segmentation philosophies. ν-net exhibits state of-the-art performance in terms of dice coefficient. Conclusions: Biventricular mass and function parameters can be determined reliably in high quality by applying a deep neural network for cardiac MRI segmentation, especially in the anatomically complex right ventricle. Adaption to individual segmentation styles by applying a simple adjustment procedure is viable, allowing for the processing of novel data without time-consuming additional training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 9

page 10

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The World Health Organization identifies ischaemic heart diseases as the leading cause of death who_top_ten_cod. Imaging technologies, such as MRI, yield clinically well established parameters, including ESV, EDV, EF and SV as well as VM. In order to determine these clinical biventricular cardiac mass and function parameters, usually a skilled physician with expertise in cardiac MRI has to segment the image data. This task is typically performed in a time span of approximately 15-20 minutes per case.

Automated image segmentation, especially of the 2D short axis cardiac cine MRI stacks is a highly competitive research field. Active contour models kass_snakes_1988; osher_fronts_1988

, and machine learning approaches 

poudel_recurrent_2016; avendi_combined_2016; tran_fully_2016 are among the most successful methods.

One major challenge for the design of robust classifiers for automated cardiac image segmentation is the lack of manually annotated training data (ground truth). Hence, models with a high number of free parameters, such as deep neural networks, tend to overfit to the characteristics of the assembled data. Image segmentation quality for similar image morphologies is typically sufficient, however, the quality might rapidly degrade for differing image characteristics. This can be caused by varying image acquisition techniques, varying experimental protocols and image morphology altering illnesses, such as cardiomyopathies. Additionally, image segmentation philosophies have been shown to have major influence on the resulting biventricular mass and function parameters. The inclusion or exclusion of trabeculations and papillary muscles affect the left and right ventricular mass as well as the endocardial volume 

sievers_impact_2004; winter_evaluating_2008; vogel-claussen_left_2006; kawel-boehm_normal_2015. No convention has been universally accepted for analyzing trabeculation and papillary muscle mass schulz-menger_standardized_2013. Moreover, the amount of realizable medical images is orders-of-magnitude bigger than the number of assembled samples. Many studies limit themselves to a single data source at a time for training and validation  avendi_combined_2016; queiros_fast_2014; hu_hybrid_2013; liu_automatic_2012; huang_image-based_2011. This might impair the generalization potential of corresponding models.

A recent paper by Tran et al. tran_fully_2016

applied transfer learning to adapt a pre-trained model to novel data sets. Unfortunately, a major drawback of this approach is the time-consuming retraining of the neural network.

Our study investigates the generalization potential of cardiac image segmentation. In detail, we have composed a diverse data set consisting of images with highly varying characteristics and further applied non-linear augmentation techniques to artificially increase the number of training samples by orders-of-magnitude. We further demonstrate how to determine fully automated high quality estimates of clinical parameters, such as end-systolic volume (ESV), end-diastolic volume (EDV), ejection fraction (EF), stroke volume (SV), and ventricular mass (VM).

2 Materials and Methods

2.1 Evaluation Measures

In this study we evaluate the performance of the proposed cardiac segmentation approach by determining the clinical gold standard parameters EF, ESV, EDV as well as VM of the left and right ventricle. Furthermore, we compare the automatically computed image segmentation with the ground truth in terms of similarity measures, such as overlap, and dice.

2.1.1 Evaluation of Biventricular Cardiac Mass and Function

The performance of the computed segmentation is determined by calculating EF, ESV and EDV. In order to determine the EF in a 2D short axis cine MRI stack, the ESV and EDV are typically measured by segmenting the corresponding volumes as described by winter_evaluating_2008 (Simpson’s method). The EF is the fraction of the EDV, which is ejected with every cardiac cycle, i.e. the normalized difference of the corresponding volumes at end-systole and end-diastole:

(1)

where denotes the ESV and refers to the EDV.

The VM is calculated as product of a constant conversion density factor and the volume of the right () or the left () ventricle, respectively:

(2)

where is a phenomenologically determined conversion factor of 1.05 g/cm. Note that the VM is usually determined at the end-diastole.

The clinical parameters are gathered for the manual ground truth and the automatically computed prediction by counting the corresponding voxels. Volumes and other derived quantities are compared in terms of Spearman’s rank correlation coefficient, the RMSD, MAPE, ME as well as ICC.

2.1.2 Evaluation of Technical Parameters

In this study we present the expected performance metrics in order to evaluate the quality of the automated cardiac segmentation. These metrics include geometrical measures quantifying overlap based on the dice and traditional Machine Learning metrics, such as accuracy, precision, recall, and specificity.

The dice is proportional to the ratio of intersection between two volumes divided by the sum of said volumes:

(3)

All performance measures are carried out on calculated 3D spatial volumes.

2.2 Data Sets

The experiments were performed on four independent 2D short axis cine cardiac MRI data sets as depicted in Figure 1:

Figure 1: Depiction of the split used for training and validating the neural network as well as the corresponding data sources from mhh, the kaggle, the LVSC, and the RVSC. The outer blue sections denote the different data sources, the red section marks the portion of unused data, and the green section connotes the resulting training/validation split.

2.2.1 mhh Data Set

The data set consists of 193 training and 309 validation 2D ssfp short axis cine MRI stacks. The end-systolic and end-diastolic contours have been created by a senior radiologist (15 year experience in cardiac MRI) and are accepted as ground truth for this study. Image acquisition was performed on three different 1.5 T MRI scanners of a single vendor (Siemens Healthineers, Erlangen, Germany) using a ssfp sequence with a slice thickness of 8 mm, no gap, an acquisition matrix of 256  208 pixels with an in-plane spatial resolution of 1.4  1.4 mm.

2.2.2 kaggle Data

The data set is composed of 500 training, 440 testing, and 200 validation 2D short axis cine stacks. It has been compiled by the NIH and the CNMC. It is at least one order-of-magnitude more extensive than any data set previously released. Each stack contains approximately 30 images over the cardiac cycle. The data set does not include a ground truth segmentation, instead, the end-systolic and end-diastolic volumes of the LV are provided. This data set has been subject of the Second Annual Data Science Bowl contest.

Our study design does not require the fine-grained separation of the data into a test, training, and validation set as provided. Therefore, the validation and test set have been merged in order to extend the validation set to 640 cases. The 2D short axis cine MRI stacks have been converted into 4D matrices, stored in the nifti format. This process failed for 14 of the training and 38 of the validation cases due to inconsistent image dimensions, resulting in 486 training and 602 validation nifti files.

The training set has been used twice. First, 60 cases with high visual diversity have been manually segmented and subsequently included in the training process of the neural network. Second, all 486 training cases have been utilized to fit a linear regression for the adjustment of the clinical parameters.

2.2.3 LVSC Data Set

The MICCAI 2009 LV Segmentation Challenge MICCAI_2009_challange_database data set has been published by the Sunnybrook Health Sciences Center (Canada). This data set has been utilized in two separate experiments:

In the first experiment, this data set is used exclusively for validation purposes in this study, i.e. none of the images is used for training the neural network. This decision has been made in order to explore the generalization potential of the network and is being referred to as the ad-hoc performance. The data set consists of 45 cases. Ground truth segmentation is available for the end-diastole and the end-systole. It is composed of 12 heart failure with infarction, 12 heart failure without infarction, 12 LV hypertrophy patients and 9 healthy subjects. The data was split into three parts: 15 for training, 15 for testing, and 15 for validation in an on-line contest. Data conversion has failed for one case (SC-HYP-37). As mentioned before, no training images of this data set have been used for this study. Therefore, all available images have been assembled into a single validation set consisting of 44 cases with end-systolic and end-diastolic segmentation, yielding a total of 88 2D image stacks.

In the second experiment, the data set has been split into 29 training and 15 validation images. A network, pretrained on the regime as depicted in Figure 1, has been retrained on this specific split, in order to evaluate the performance in contrast to the ad hoc results, and referred to as results with retraining.

Image acquisition was performed on a 1.5 T GE Signa MRI scanner from the atrioventricular ring to the apex in 6 to 12 2D short axis cine stacks with a slice thickness of 8 mm, a gap of 8 mm, a field of view of 320  320 mm with a matrix resolution of 256  256.

This data set is relevant since it has been extensively used in prior studies for training and evaluation. The end-systolic and end-diastolic contours are provided for the endo- and epicardial volume.

Figure 2: Topology of -net, the artifical neural network of this study. The architecture is derived from U-Net. Each blue box corresponds to a multi-channel feature map. Size and number of channels are denoted at the upper and left of each box. Different operations are symbolized by color coded arrows.

2.2.4 RVSC Data Set

The RVSC was held at the MICCAI 2012 conference. For this challenge a data set consisting of 16 training and 32 validation 2D short axis cine MRI stacks were acquired. Contour data is only provided for the training set, the validation contours are withheld by the authors in order to ensure an independent validation.

For this study we utilize the training set for validation, as contour information is only available for the training set. However, it has to be stressed, that no image or segmentation data of the training set has been used in the training process of the neural network. All 16 training cases have been segmented at the end-systolic and end-diastolic phase, yielding a total of 32 2D image stacks. Throughout the rest of this paper we refer to the actual training set as the validation set of this study.

Image acquisition has been performed on a 1.5 T scanner (Symphony Tim, Siemens Medical Systems, Erlangen, Germany). Retrospectively gated balanced ssfp cine MRI sequences were performed for analysis with repeated breath-holds of 10-15 s. A total of 8-12 2D short axis cine planes were acquired from the base to the apex of the ventricles. The temporal resolution of the cine images is 20 images per cardiac cycle. All images have been zoomed and cropped to a 256  216 or 216  256 resolution, leaving the LV visible.

2.3 Network Topology

The topology of -net (/nju:nt/), the neural network of this study, is depicted in Figure 2. It is derived from the U-Net architecture ronneberger_u-net_2015

and has been implemented in TensorFlow 

tensorflow_2015. The input layer has been resized to 256 

 256 neurons with a consecutive downsampling of the subsequent layers using a 3 

 3 convolution with a 2 

 2 striding in contrast to a max-pooling operation of the original U-Net architecture. The padding has been changed from valid to same, resulting in an output layer of the same size as the input layer. All activation functions have been changed from a relu to a prelu as proposed by He et al. 

he_delving_2015.

2.4 Network Training

The neural network was trained by minimizing binary cross-entropy as objective function. Backpropagation was used to compute the gradients of the cross-entropy loss. The model was initialized with random values sampled from a uniform distribution without scaling variance (uniform scaling) as proposed by Glorot and Bengio 

glorot_understanding_2010

. Adaptive Moment Estimation (Adam) 

kingma_adam_2014 was chosen as stochastic optimization method. The initial learning rate of 10 was gradually reduced down to 10 during training.

The training set is assembled from 3,519 2D images (2,894 Hannover Medical School and 625 Data Science Bowl Cardiac Challenge images). Data augmentation has been applied to artificially inflate the training set as described in Section 2.5. One complete training run of -net takes about 24 to 36 hours.

2.5 Data Augmentation

Figure 3: An example for image augmentation on the LVSC (SC-HF-I-1) data set. The upper two panels depict a checkerboard pattern on a domain that has been rotated, sheared, scaled, translated, randomly flipped along the axes and locally deformed (). The lower to panels apply the same augmentation to an MRI scan with an exemplary segmentation. The augmented version is also used during training in order to artificially inflate the number of samples.

Data augmentation is often used to artificially inflate the training data set simard_best_2003; ciresan_high-performance_2011; ciregan_multi-column_2012. This technique generates similar images and their corresponding segmentations from already existing data by applying local spatial transformations to them. The key idea behind this approach is that a slightly deformed heart should be identified by the neural network in a similar manner. Due to the image augmentation in the training phase, the algorithm provides good segmentation results regardless of the orientation, scale and parity of the input image. Translational equivariance is inherited from the convolutional network architecture, i.e. a transformed heart is mapped pixel-wise on the corresponding transformed segmentation in the spatial domain.

In order to fully utilize the computational resources of modern workstations equipped with multicore CPU and CUDA-enabled GPU, we have developed an auxiliary library that facilitates parallel image augmentation on the CPU while training of the neural network is delegated to the GPU. The time needed for image augmentation can completely be hidden since concurrent augmentation of a batch of input images can be accomplished faster than the actual training step. Hence, augmentation and training are performed efficiently in an interleaved manner: the next batch of images is augmented on the CPU while the current batch is still being trained on the GPU.

The library is implemented in the C++ programming language and the OpenMP extension. It features convenient bindings for the Python programming language which seamlessly interact with the TensorFlow framework. Moreover, we can apply highly non-linear, and computationally expensive local deformations to the input data as well as the ground truth segmentations, due to the aforementioned efficient interleaving. Besides traditional global transformations from the affine group such as scaling, shearing, rotation, mirroring, and translations, we allow for the pixel-wise deformation of the spatial pixel domain

(4)

where are second degree multivariate polynomials in the pixel coordinates and 100 f^(±)(i,j) :&= a_i^(±)⋅i &&+ a_j^(±) ⋅j + a_ij^(±) ⋅ij
&+ a_ii^(±) ⋅i^2 &&+ a_jj^(±) ⋅j^2 and the coefficients are sampled from a uniform distribution over the closed interval . The hyper-parameter controls the amount of deformation. The special case

refers to no deformation. Fractional indices are mapped via bilinear interpolation. Figure 

3 shows an exemplary augmentation of an MRI.

3 Results

Figure 4: Depiction of the resulting prediction of an overfitted model. For this illustration -net was trained solely on the mhh data set. Inference was performed on an kaggle image. From left to right, this figure shows the MRI, the predicted segmentation with the overfitted network and the predicted segmentation with the final -net, trained as described in Section 2.4.

This study explores the possibility of creating a general purpose cardiac image segmentation model, capable of reliably producing high quality segmentations, independent of aspects such as different image acquisition techniques, and diverse MRI protocols. For this purpose, the model was trained on a proprietary data set (mhh) as well as a small subsample of the kaggle training set. The goal was to learn the specific concept of cardiac segmentation from the highly standardized mhh data set as well as abstracting a more general notion for different image morphologies from the heterogeneous kaggle data set and, in turn, prevent overfitting on the characteristics of a specific data set, as depicted in Figure 4. The resulting intraclass correlation coefficients for all evaluated data sets are listed in Table 2.

Figure 5: Correlation plot of the RV and LV endocardial volumes and VM of the Hannover Medical School Data Set. Spearman’s rho (rho), RMSD, ME and MAPE have been calculated overall.

The agreement of the predicted segmentations with the ground truth is high for the mhh data set with a Spearman’s rho of 0.98 for an overall of segmented volumes of the left and right endocardium as well as VM as depicted in Figure 5. This is supported by a dice of about 90 % as depicted in Table 1 as well as an high ICC of 92-99 % for the ESV and EDV of both ventricles. A lower dice of 78 % is obtained for the right VM. Further results, such as dice, overlap, and accuracy are listed in Table 3. A selection of images of the mhh, LVSC, and RVSC data sets is illustrated in Figure 6.

Figure 6: A selection of images of the mhh, LVSC, and the RVSC data sets. From the left to the right column, the images depict the MRI, the predicted segmentation, the ground truth segmentation and the difference between ground truth and prediction, where the error is denoted in red. Each case has a legend left of the image with ”data set - case id and region”. The upper two rows show good agreement between the predicted segmentation by -net and the ground truth at the apex and the base. After an extensive search in the -net mhh results, no major disagreement between the manual and automated segmentation for the cardiac contours could be found. The lower three rows show cases at the apex and base of disagreement between predicted segmentation and ground truth. Retrospectively, one could argue that the -net segmentation provides a more accurate delineation of the epi- and endocardium, compared to the ground truth.
method V DSC (epi) DSC (endo)
Hannover Medical School
proposed LV 952 % 924 %
proposed RV 904 % 886 %
MICCAI 2009 LV Segmentation Challenge
proposed (ad-hoc) LV 933 % 847 %
proposed (with retraining) LV 953 % 943 %
ngo_combining_2017 LV 932 % 883 %
avendi_combined_2016 LV n/a 942 %
poudel_recurrent_2016 LV n/a 904 %
tran_fully_2016 LV 961 % 923 %
queiros_fast_2014 LV 942 % 905 %
ngo_fully_2014 LV n/a 883 %
hu_hybrid_2013 LV 942 % 893 %
liu_automatic_2012 LV 942 % 883 %
huang_image-based_2011 LV 932 % 894 %
Right Ventricular Segmentation Challenge
proposed (ad-hoc) RV 866 % 857 %
avendi_fully_2016 RV n/a 8121 %
tran_fully_2016 RV 8611 % 8421 %
zuluaga_multi-atlas_2013 RV 8022 % 7625 %
wang2012simple RV 6335 % 5934 %
ou_multi-atlas_2012 RV 6327 % 5831 %
Table 1: Comparison of the RV and LV (V) segmentation performance of the epi- and endocardium (endo) and ventricular mass (VM) based on the dice. All values are denoted as meanstd.

The predicted segmentation can be used to directly compute the clinical parameters on mhh data by applying the Simpson’s method winter_evaluating_2008. The same approach cannot be performed on MRI stemming from different sources, because of different segmentation philosophies, resulting in differing clinical measurements for the same case. To compensate for this variation, a linear regression has been performed to adapt the predicted ESV and EDV of the kaggle data set. This fit has been determined using standard linear regression, mapping the predicted volumes of the training set onto the ground truth scalar volumes, whilst omitting the vertical intercept. Neither the model in training, nor the linear regressor were fitted to any of the validation cases in order to eliminate the possibility of leakage. As depicted in Figure 7 the agreement with the ground truth is high with a Spearman’s rho of 0.96, an ICC of 92/94 % (ESV/EDV), and a MAPE of 14 %.

The performance of the proposed method would have ranked under the best 20 competitors of the Second Annual Data Science Bowl with a CRP score of 0.0154. Figure 8 depicts the weakest segmentation, based on the CRP score, for additional illustration.

Figure 7: Correlation plot of the kaggle ground truth versus the prediction. A small subset of the Data Science Bowl training images were manually segmented and used in the training process. Then the resulting classifier was used to predict the left ventricular volumes of the training set. The results were used to calculate the systematic error via linear regression. Then the classifier was used to segment the validation data set. The results were adjusted with the conversion factor determined by the linear regression. The adjusted results are plotted in the lower subfigure. Spearman’s rho (rho), RMSD, ME and MAPE have been calculated overall.

The correlation of the left and right epi- and endocardial volumes of the LVSC and RVSC data sets is high with a Spearman’s rho of 0.95 as depicted in Figures 11 and 12 as well as ICC values of 96/97 % (ESV/EDV) for the left and 92/96 % for the right ventricle. A lower rho of 0.83 is expected for the right epicardium. dice for a selection of relevant studies are presented in Table 1. RMSD, ME and MAPE were calculated after adjusting the resulting volumes with a linear regression, fitted on the validation-set. This was necessary, as the validation sets are too small to perform a reasonable split. A random sample of images and the corrosponding segmentation from the LVSC and RVSC data sets is illustrated in Figures 9 and 10.

MHH DSBCC LVSC RVSC HUMAN
RV LV LV LV RV RV LV
EF 0.960 0.983 0.794 0.945 0.867 0.800 0.953
EDV 0.958 0.985 0.935 0.966 0.924 0.892 0.987
ESV 0.920 0.953 0.918 0.962 0.958 0.917 0.992
SV 0.923 0.978 0.898 0.907 0.841 0.814 0.867
VM 0.832 0.948 0.941 0.825 0.54 0.848
Table 2: Comparison of the ICC for the clinical parameters EF (%), EDV (ml/m), ESV (ml/m), SV (ml/m), and VM (g/m) between the ground truth and the fully automatically measured volumes. Human inter-observer ICC (denoted as human) has been determined by Caudron et al. caudron_cardiac_2012. Coefficients, surpassing human performance, have been highlighted. LVSC and RVSC are reported as ad-hoc performance.

4 Discussion

There are many philosophies on how to perform image segmentation of cardiac MRI. These philosophies have major influence on the resulting biventricular mass and function parameters. For example, the inclusion or exclusion of trabeculations and papillary muscles affect the left and right ventricular mass as well as the endocardial volume sievers_impact_2004; winter_evaluating_2008; vogel-claussen_left_2006. No convention has been universally accepted for analyzing trabeculation and papillary muscle mass schulz-menger_standardized_2013.

Typically, these conventions are enforced on an institute basis. This poses a major challenge for the automated image segmentation with deep neural networks, because of the resulting difference of the clinical measurements. In order to learn the specific segmentation characteristics of a site, a specifically tailored model would have to be trained. This renders the whole process impractical.

However, most of the time, image segmentation is only means to the end of determining the clinical parameters by measuring the ventricular volumes. These parameters include the VM, ESV as well as EDV, which yield the EF and SV. These established clinical parameters are of great importance in early detection of cardiac illnesses as well as treatment monitoring.

The hypothesis of this study is that the clinical parameters can be reliably measured by adapting a pre-trained neural network to a new environment and applying one of the most basic statistical models, a linear regression. This result is unexpected since differences in segmentation guidelines are usually of local nature and do not necessarily need to exhibit a linear dependency on the final measurement. In order to substantiate this assumption, -net was trained on the extensive mhh data set. -net demonstrates state-of-the-art performance, as depicted in the results Figure 5. Furthermore, -net was trained on a small, hand-picked subset of the kaggle training set. This was done with the idea of transfer learning in mind, in order to convey an understanding of different image acquisition methods, and varying image morphologies to the neural network.

-net was benchmarked against the kaggle validation set. As depicted in Figure 7, it was possible to adapt the results of the classifier for the specific data set by employing a linear regression, fitted on the training set.

The performance of the proposed method would have ranked under the best 20 competitors of the Second Annual Data Science Bowl. A strong correlation between predicted and ground truth volumes was observed with a Spearman’s rho of 0.965 and a MAPE of 14.4 %. Nevertheless, the final predicted volumes of the proposed solution could be fine-tuned using high-level machine learning regressors as performed by the winning solution. This study, however, explicitly omits sophisticated post-processing and use of meta data (such as gender, age, height, and scanner geometry) in order to avoid over-fitting to a specific experimental setting. This could reasonably impair the neural network’s generalization potential.

The hypothesis is further substantiated by the results of the LVSC and RVSC data sets. As depicted in Figures 11 and 12 the segmentation results demonstrate a strong correlation with the manual segmentation exhibiting a Spearman’s rho of approximately 0.95. Furthermore, the corresponding ICC imply human level performance in determining clinical parameters. -net surpasses human level in predicting parameters for the right ventricle. Note that volume prediction was performed without training on a single image of these data sets.

-net demonstrates a state-of-the-art ad-hoc segmentation performance in terms of dice for the epi- and endocardium of the RVSC and the epicardium of the LVSC data set compared to tran_fully_2016 and avendi_fully_2016 as well as other studies avendi_combined_2016; ngo_combining_2017; poudel_recurrent_2016; queiros_fast_2014; hu_hybrid_2013; liu_automatic_2012; huang_image-based_2011; zuluaga_multi-atlas_2013; wang2012simple; ou_multi-atlas_2012. This is remarkable, as -net was not trained on any images of the aforementioned data sets. The slightly weaker ad-hoc dice of 84 % for the endocardium of the LVSC could reflect different segmentation philosophies compared to the original MHH data set, resulting in a systematic error as shown in Figure 11. This hypothesis is supported by corresponding high ad-hoc ICC values for the ESV, EDV, EF, and SV. Furthermore, additional retraining results in state-of-the-art performance for the endocardial dice.

Regarding the MHH data set, -net achieves comparable or higher agreement with the ground truth than two human observers agree on average in measuring biventricular mass and function parameters caudron_cardiac_2012. Furthermore, -net accomplishes comparable to human performance on the LVSC and RVSC data sets. -net outperforms a human by a wide margin especially at the task of gauging the right ventricular endocardial volume and ventricular mass. A slightly lower ICC score of the left endocardial volumes on the kaggle data set is most likely due to a multi-center and multi-observer setting, resulting in a inherent heterogeneity in the data set. In order to improve the performance, the results would have to be evaluated for each observer independently.

One limitation of this study is the small size of openly available data sets. The LVSC and RVSC contain 61 cases with freely accessible contours. Furthermore, the aforementioned data sets include the segmentation of the LV or RV exclusively. Additionally, training and validating on a single center data set bears the risk of overfitting as demonstrated in Figure 4. Therefore, a large, fully labeled, multi-center, multi-reader data set would be advantageous. This data set could be used to train and evaluate future models.

5 Conclusion

This study demonstrates the reliability of automatically determining clinical cardiac parameters such as ESV, EDV, EF, SV and VM on 4 data sets. Especially in the RV the neuronal network outperformed the human cardiac expert in the presented study, which likely enables more reliable RV mass and function measurements for improved clinical treatment monitoring in the future. Furthermore, it is demonstrated that the aforementioned parameters, resulting of an image segmentation by a pre-trained neural network, can be adjusted by performing a linear regression. This effectively eliminates the associated costs of introducing a neural network for determining clinical cardiac parameters in a new setting.

5.1 Perspectives

COMPETENCY IN MEDICAL KNOWLEDGE: Deep learning can reliably determine high quality fully-automated cardiac segmentation for precise determination of clinically well established biventricular mass and function parameters in a multi center setting.

TRANSLATIONAL OUTLOOK: The presented neuronal network is ready to be used in large scale, multi center, multi reader data sets for cost- and time efficient analysis of cardiac mass and function parameters.

6 Acknowledgments

The authors thank all parties involved in the data acquisition process, especially Frank Schröder and Lars Kähler.

Appendix A Figures and Tables

Figure 8: Depiction of the automated image segmentation results for the case with the worst systolic (0.095) and diastolic (0.129) CRP score in the kaggle data set. The upper plot illustrates the measured volumes for the left and right epi- and endocardium against time. The image number (#) for end-systole and end-diastole were determined automatically by -net. End-systole and end-diastole were identified automatically. The ground truth for this sample is 127 ml systolic and 291.4 ml diastolic. However, the segmentation suggest an end-systolic volume of 64.5 ml and an end-diastolic volume of 198 ml.
Figure 9: 24 randomly chosen samples of the predicted segmentation of -net for the LVSC data set. The title of each sub-figure denotes the case identifier and the randomly selected image slice in brackets.
Figure 10: 24 randomly chosen samples of the predicted segmentation of -net for the RVSC data set. The title of each sub-figure denotes the case identifier and the randomly selected image slice in brackets.
Figure 11: Correlation graph of the ad-hoc performance of -net for the LVSC data set, depicting the LV endocardial volume and ventricular mass. Spearman’s rho (rho), RMSD, ME and MAPE have been calculated after applying the linear fit.
Figure 12: Correlation graph of the ad-hoc performance of -net for the RVSC data set, depicting the LV endocardial volume and ventricular mass. Spearman’s rho (rho), RMSD, ME and MAPE have been calculated after applying the linear fit.
label accuracy specificity precision recall dice overlap
LV endocardial 99.90.0 % 99.90.0 % 91.16.6 % 92.35.1 % 91.54.3 % 84.66.9 %
LV mass 99.80.1 % 99.90.0 % 89.64.3 % 88.34.0 % 88.82.9 % 80.14.6 %
LV epicardial 99.80.1 % 99.90.1 % 94.83.4 % 94.32.7 % 94.52.0 % 89.63.4 %
RV endocardial 99.90.1 % 99.90.0 % 89.46.2 % 86.77.9 % 87.75.5 % 78.58.1 %
RV mass 99.80.1 % 99.90.0 % 80.48.2 % 72.47.1 % 75.85.9 % 61.47.5 %
RV epicardial 99.80.1 % 99.90.0 % 92.94.1 % 87.25.9 % 89.83.8 % 81.86.0 %
Table 3: Descriptive statistical analysis of the results based on the Hannover Medical School Data-Set. All values have been computed at the end-systolic and end-diastolic phase. All values are denoted as meanstd.

Appendix B Bibliography