Segmentation of cardiac structures in magnetic resonance images (MRI) has potential uses for many clinical applications. In particular for cardiac magnetic resonance (CMR) images, late gadolinium-enhanced (LGE) imaging is useful to visualize and detect myocardial infarction (MI). Another common CMR sequence is T2-weighted imaging which highlights acute injury and ischemic regions. Additionally, balanced-steady state free precession (bSSFP) cine sequences can be utilized to analyze the cardiac motion of the heart [1, 2]. Each CMR sequence is typically acquired independently, and they can exhibit significant spatial deformations among each other even when stemming from the same patient. Nevertheless, segmentation of different anatomies from LGE could still benefit from the combination with the other two sequences (T2 and bSSFP) and their annotations. An example of different CMR sequences utilized in this work can be seen in Fig. 4
. LGE enhances infarcted tissues in the myocardium and therefore is an important sequence to focus on for the detection and quantification of myocardial infarction. The infarcted myocardium tissue appears with a distinctively brighter intensity than the surrounding healthy regions. In particular, LGE images are important to estimate the extent of the infarct in comparison to the myocardium. However, manual delineation of the myocardium is time-consuming and error-prone. Therefore, automated and robust methods for providing a segmentation of the cardiac anatomy around the left ventricle (LV) are needed to support the analysis of myocardial infarction. Modern semantic segmentation methods utilizing deep learning have significantly improved the performance in various medical imaging applications [3, 4, 5, 6]
. At the same time, deep learning methods typically require large amounts of annotated data in order to train sufficiently robust and accurate models depending on the difficulty of the task. However, in many use cases, the availability of such annotated cases may be limited for a specific targeted image modality or sequence. For CMR applications containing multiple sequences, annotations for the same anatomy of interest might be available for sequences other than the target one of the same patient. In this work, we attempt the segmentation of cardiac structures in LGE cardiac magnetic resonance (CMR) images utilizing classical methods from multi-atlas label fusion in order to provide “noisy” pseudo labels to be used for training deep convolutional neural network segmentation models.
Our method can be described in two steps. In the first step, we register a small set, e.g. 5, LGE CMR with ground truth labels (“atlases”) to a set of target LGE CMR images without annotation. Each ground truth atlas provides manually annotated labels of the myocardium, and the left and right ventricle cavities. After multi-atlas label fusion by majority voting, we possess noisy labels for each of the targeted LGE images. A second set of manual labels exists for some of the patients of the targeted LGE CMR images, but are annotated on different MRI sequences (bSSFP and T2-weighted). Again, we use multi-atlas label fusion with a consistency constraint to further refine our noisy labels if additional annotations in other sequences are available for that patient. In the second step, we train a deep convolutional network for semantic segmentation on the target data while using data augmentation techniques to avoid over-fitting to the noisy labels. After inference and simple post-processing, we arrive at our final label for the targeted LGE CMR images.
2.1 Multi-Atlas Label Fusion of CMR
Many methods of multi-atlas label fusion exist . In this work, we use a well-established non-rigid registration framework based on a B-spline deformation model  using the implementation provided by . The registration is driven by a similarity measurement based on intensities from LGE, T2, and bSSFP images. We perform two sets of registrations
Inter-patient and intra-modality registration, i.e. the registration of LGE with annotations to the targeted LGE images of different patients.
Intra-patient and inter-modality registration, i.e. the registration of bSSFP/T2 with annotations to the targeted LGE images of the same patient.
In both cases, an initial affine registration is performed followed by non-rigid registration between the source image (providing annotation, i.e. the “atlas”) and the targeted reference image . A coarse-to-fine registration scheme is used in order to first capture large deformations between the images, followed by more detailed refinements. The deformation is modeled with a 3D cubic B-spline model using a lattice of control points and spacings between the control points of , , and along the -, -, and -axis of the image, respectively. Hence, the deformation of a voxel to the domain of the target image can be formulated as
Here, represents the cubic B-Spline function. By maximizing an overall objective function
we can find the optimal deformation field between source and targeted images. Here, the similarity measure is constrained by two penalties and which aim to enforce physically plausible deformations. The contribution of each penalty term can be controlled with the weights and , respectively. We use normalized mutual information (NMI)  which is commonly used in inter-modality registrations  as our driving similarity measure
Here, and are the two marginal entropies, and ) is the joint entropy. In , a Parzen Window (PW) approach  is utilized to fill the joint histogram necessary in order to compute the NMI between the images efficiently. To encourage realistic deformations, we utilize bending energy which controls the “smoothness” of the deformation field across the image domain :
In an ideal registration, the optimized transformations from to (forward) and to (backward) are the inverse of each other. i.e. and . The used implementation by  follows the approach by  using compositions of and in order to include a penalty term that encourages inverse consistency of both transformations:
At each level of the registration, both the image and control point grid resolutions are doubled compared to the previous level. We find suitable registration parameters for both type 1) and type 2) registrations using visual inspection of the transformed image and ground truth atlases. For type 1) registrations, multiple atlases are available to be registered with each target image. We perform a simple majority voting in order to generate our “noisy” segmentation label for each target image .
2.2 Label Consistency with Same Patient Atlases
Because of anatomical consistency between different sequences of the same patient, we employ inter-modality registration to obtain noisy labels for LGE images in type 2) registrations. Two sets of segmentations, denoted by and , can be obtained from the registrations: bSSFP to LGE, and T2 to LGE. In order to make sure our noisy labels are accurate enough, we only employ the consistent region where both segmentations agree. In the non-consistent regions, we still use the noisy label from type 1) registrations. In type 1) registrations, we use symmetric registration with bending energy factor and inconsistency factor . We use five resolution levels and the maximal number of iteration per level is . The final grid spacing along , and are the same with five voxels. In type 2) registrations, we use six levels and the maximal number of iteration per level is . The final grid spacing along , and are the same with one voxel.
2.3 Deep Learning based Segmentation with Noisy Labels
In the second step, we train different deep convolutional networks for semantic segmentation on the target data while using data augmentation techniques (rotation, scaling, adding noise, etc.) to avoid over-fitting to the noisy labels.
Given all pairs of images and pseudo labels , we re-sample them to 1 isotropic resolution and train an ensemble of fully convolutional neural networks to segment the given foreground classes, with standing for the softmax
output probability maps for the different classes in the image. Our network architectures follow the encoder-decoder network proposed in, named AH-Net, and  based on the popular 3D U-Net architecture 
with residual connections, named SegResNet. For training and implementing these neural networks, we used the NVIDIA Clara Train SDK111https://devblogs.nvidia.com/annotate-adapt-model-medical-imaging-clara-train-sdk and NVIDIA Tesla V100 GPU with 16 GB memory. As in , we initialize AH-Net from ImageNet pretrained weights using a ResNet-18 encoder branch, utilizing anisotropic (
) kernels in the encoder path in order to make use of pretrained weights from 2D computer vision tasks. While the initial weights are learned from 2D, all convolutions are still applied in a full 3D fashion throughout the network, allowing it to efficiently learn 3D features from the image. In order to encourage view differences in our ensemble models, we initialize the weights in all three major 3D image planes, i.e., , and , corresponding to axial, sagittal, and coronal planes of the images. This approach results in three distinct AH-Net models to be used in our ensemble . The Dice loss  has been established as the objective function of choice for medical image segmentation tasks. Its properties make it suitable for the unbalanced class labels common in 3D medical images:
Here, is the predicted probability from our network and is the label from our “noisy” label map at voxel . For simplicity we show the Dice loss for one foreground class in Eq. 6
. In practice, we minimize the average Dice loss across the different foreground classes. After inference and simple post-processing, we arrive at our final label set for the targeted LGE CMR images. We resize the ensemble models’ prediction maps to the original image resolution using trilinear interpolation, fuse each probability map using anmedian
operator in order to reduce outliers. Then, the label index is assigned using theargmax operator:
Finally, we apply 3D largest connected component analysis on the foreground in order to remove isolated outliers.
3 Experiments & Results
3.1 Challenge Data
The challenge organizers provided the anonymized imaging data of 45 patients with cardiomyopathy who underwent CMR imaging at the Shanghai Renji hospital, China, with institutional ethics approval. For each patient, three CMR sequences (LGE, T2, and bSSF) are provided as multi-slice images in the ventricular short-axis views acquired at breath-hold. Slice-by-slice manual annotations of the right and left ventricular, and ventricular myocardium have been generated as gold-standard using ITK-SNAP222http://www.itksnap.org for training of the mdoels and for evaluation the segmentation results. The manual segmentation took about 20 minutes/case as stated by the challenge organizers. We also use ITK-SNAP for all the visualizations shown in this paper. For more details, see the challenge website333https://zmiclab.github.io/mscmrseg19/data.html. The available training and test data have the following characteristics:
LGE CMR (image + manual label) for validation
T2-weighted CMR (image + manual label)
bSSFP CMR (image + manual label)
T2-weighted CMR (image + manual label)
bSSFP CMR (image + manual label)
T2-weighted CMR (only image)
bSSFP CMR (only image)
LGE CMR (only image)
As one can see, only five ground truth annotations are available in the targeted LGE images. However, 30 images have gold standard annotations available in different image modalities, i.e. bSSFP and T2. We use all available annotations for type 1) and type 2) multi-atlas label fusion approaches described in Section 2. After “noisy” label generation for all testing LGE images, we train our deep neural network ensemble to produce the final prediction labels for 40 LGE images in the test set. The five manually annotated LGE cases are used as the validation set during deep neural network training in order to find the best model parameters and avoid overfitting completely to the noisy labels. Throughout the challenge, the authors are blinded to the ground truth of the test set during model development and evaluation. Our evaluation scores on the test set are summarized in Table 1. A comparison of the available ground truth annotation in a validation LGE dataset and our model’s prediction is shown in Fig. 10.
|Metric||LV Cavity||LV Myocardium||RV Cavity||Average|
|Surface distance [mm]||2.13||2.32||2.80||2.41|
|Hausdorff distance [mm]||11.6||16.3||18.1||15.3|
4 Discussion & Conclusion
In this work, we combined classical methods of multi-atlas label fusion with deep learning. We utilized the ability of multi-atlas label fusion to generate labels for new images using only a small set of labeled images of the targeted image modality as atlases, although resulting in less accurate (or “noisy”) labels when compared to manual segmentation. Furthermore, we enhanced the noisy labels by merging more atlas-based label fusion results if annotations of the same patient’s anatomy are available in different image modalities. Here, they came from different MRI sequences, but they could potentially stem from even more different modalities like CT, using multi-modality similarity measures to drive the registrations. After training a round of deep convolutional neural networks on the “noisy” labels, we can see a clear visual improvement over multi-atlas label fusion result. This points to the fact that neural networks can still learn correlations of the data and the desired labels even when training labels are not as accurate as ground truth supervision labels . The networks are able to compensate for some of the non-systematic errors in the “noisy” labels and hence improve the overall segmentation. We are blinded to the test set ground truth annotations and cannot quantify these improvements but visually, the improvements are noticeable as shown in Fig. 16. In conclusion, we achieved the automatic segmentation of cardiac structures in LGE magnetic resonance images by combing classical methods from multi-atlas label fusion and modern deep learning-based segmentation, resulting in visually compelling segmentation results.
-  Zhuang, X.: Multivariate mixture model for myocardial segmentation combining multi-source images. IEEE transactions on pattern analysis and machine intelligence (2018)
-  Zhuang, X.: Multivariate mixture model for cardiac segmentation from multi-sequence mri. In: MICCAI, Springer (2016) 581–588
-  Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net: learning dense volumetric segmentation from sparse annotation. In: MICCAI, Springer (2016) 424–432
-  Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3D Vision (3DV), 2016 Fourth International Conference on, IEEE (2016) 565–571
3d mri brain tumor segmentation using autoencoder regularization.In: International MICCAI Brainlesion Workshop, Springer (2018) 311–320
-  Zhu, W., Huang, Y., Zeng, L., Chen, X., Liu, Y., Qian, Z., Du, N., Fan, W., Xie, X.: Anatomynet: Deep learning for fast and fully automated whole-volume segmentation of head and neck anatomy. Medical physics 46(2) (2019) 576–589
-  Iglesias, J.E., Sabuncu, M.R.: Multi-atlas segmentation of biomedical images: a survey. Medical image analysis 24(1) (2015) 205–219
-  Rueckert, D., Sonoda, L., Hayes, C., Hill, D., Leach, M., Hawkes, D.: Nonrigid registration using free-form deformations: Application to breast mr images. IEEE Trans. Med. Imaging 18(8) (1999) 712–721
-  Modat, M., Ridgway, G., Taylor, Z., Lehmann, M., Barnes, J., Hawkes, D., Fox, N., Ourselin, S.: Fast free-form deformation using graphics processing units. Comput. Meth. Prog. Bio. 98(3) (2010) 278–284
-  Studholme, C., Hill, D.L., Hawkes, D.J.: An overlap invariant entropy measure of 3d medical image alignment. Pattern recognition 32(1) (1999) 71–86
-  Iglesias, J.E., Sabuncu, M.R.: Multi-atlas segmentation of biomedical images: A survey. Medical Image Analysis 24(1) (2015) 205 – 219
-  Mattes, D., Haynor, D.R., Vesselle, H., Lewellen, T.K., Eubank, W.: Pet-ct image registration in the chest using free-form deformations. IEEE transactions on medical imaging 22(1) (2003) 120–128
-  Feng, W., Reeves, S., Denney, T., Lloyd, S., Dell’Italia, L., Gupta, H.: A new consistent image registration formulation with a b-spline deformation model. In: ISBI. (2009) 979–982
-  Modat, M., Cardoso, M., Daga, P., Cash, D., Fox, N., Ourselin, S.: Inverse-consistent symmetric free form deformation. Biomedical Image Registration 7359 (2012) 79–88
-  Liu, S., Xu, D., Zhou, S.K., Pauly, O., Grbic, S., Mertelmeier, T., Wicklein, J., Jerebko, A., Cai, W., Comaniciu, D.: 3D anisotropic hybrid network: Transferring convolutional features from 2D images to 3D anisotropic volumes. In: MICCAI, Springer (2018) 851–858
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. (2016) 770–778
-  Heller, N., Dean, J., Papanikolopoulos, N.: Imperfect segmentation labels: How much do they matter? In: Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. Springer (2018) 112–120