End-to-End Diagnosis and Segmentation Learning from Cardiac Magnetic Resonance Imaging

10/23/2018 ∙ by Gerard Snaauw, et al. ∙ 4

Cardiac magnetic resonance (CMR) is used extensively in the diagnosis and management of cardiovascular disease. Deep learning methods have proven to deliver segmentation results comparable to human experts in CMR imaging, but there have been no convincing results for the problem of end-to-end segmentation and diagnosis from CMR. This is in part due to a lack of sufficiently large datasets required to train robust diagnosis models. In this paper, we propose a learning method to train diagnosis models, where our approach is designed to work with relatively small datasets. In particular, the optimisation loss is based on multi-task learning that jointly trains for the tasks of segmentation and diagnosis classification. We hypothesize that segmentation has a regularizing effect on the learning of features relevant for diagnosis. Using the 100 training and 50 testing samples available from the Automated Cardiac Diagnosis Challenge (ACDC) dataset, which has a balanced distribution of 5 cardiac diagnoses, we observe a reduction of the classification error from 32 baseline without segmentation. To the best of our knowledge, this is the best diagnosis results from CMR using an end-to-end diagnosis and segmentation learning method.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cardiovascular disease (CVD) is consistently ranked the leading cause of death worldwide, killing more people in 2016 than the next four causes together [1]. Cardiovascular Magnetic Resonance (CMR) imaging has proven to be of great value in CVD diagnosis and management. A combination of factors such as lack of ionizing radiation, excellent soft tissue contrast, and high reproducibility have made it the preferred imaging modality in the quantification of ventricular volumes, myocardial function and scarring visualization [2, 3]. Increasing clinical use has also resulted in an increased application of CMR in large cohort studies [4]

. This proliferation of medical imaging datasets will impact the need for automated tools, making machine learning for imaging data a very promising field.

Current machine learning based methods for automated cardiac diagnosis focus on the detection and segmentation of the heart, followed by the extraction of handcrafted features that are then used for diagnosis [5]

. This approach is reflected in the 2017 Automated Cardiac Diagnosis Challenge (ACDC) where the aim is to automatically perform segmentation and diagnosis on a 4D cine-CMR scan. All but one participant in the segmentation part of the challenge used deep learning, where structures like U-net and dilated convolutional network were explored – the best deep learning approaches scored on par with clinical experts. Interestingly enough, none of the participants in the diagnosis part of the challenge used deep learning. Instead, they performed classification using support vector machines (SVM) and random forests (RF) on handcrafted features extracted from segmentation maps 


The design and implementation of handcrafted features have numerous disadvantages [6]: sub-optimality for the classification task, requirement of a manual re-design process for new tasks, in-depth knowledge of the task for the design of relevant features, etc. One of the major motivations for the development of deep learning models is exactly the automatic design of features that are learned to solve particular classification tasks – this mitigates all the negative points listed above. In fact, deep learning has consistently shown state-of-the-art segmentation and classification results [7]. However, to our knowledge, there have been no convincing attempt at an end-to-end learning for segmentation and diagnosis classification in cardiology. One possible explanation for this is the lack of large datasets available for this task [6].

In this paper, we propose a multi-task learning process that combines cardiac segmentation and diagnosis classification using the ACDC dataset, which has a balanced distribution of 5 cardiac diagnoses. This multi-task learning guides the automatic design of features relevant for both tasks and serves two purposes: 1) regularization of cardiac diagnosis training process, and 2) reduction of convergence time. In addition to this, we evaluate an exponential version of the linear Dice loss to overcome the current limitations of segmenting objects with different sizes, as inspired by Wong et al. [8]. Results show a reduction of the classification error from 32% to 22%, and a faster convergence compared to a baseline without segmentation. To the best of our knowledge, this is the best result of an end-to-end trained segmentation and classification method for diagnosing from CMR.

2 Materials and Methods

2.1 Dataset

The ACDC dataset consists of training and testing sets with 100 and 50 4D cine-CMR scans, respectively. Both sets contain a balanced distribution of the following five classes: {Normal (NOR), dilated cardiomyopathy (DCM), hypertropic cardiomyopathy (HCM), prior myocardial infarction (MINF), abnormal right ventricle (ARV)}. These sets also contain manual segmentation of the {left ventricular cavity (LV), right ventricular cavity (RV), LV myocardium (Myo), background (BG)} at the end-systolic (ES) and end-diastolic (ED) phases.

Scans are resampled to

mm in the in-slice plane; then they are center cropped and normalized to zero mean and standard deviation one. Center cropping is performed around the heart bounding box, which is extracted from the segmentation maps for the training set and defined manually for the test set. Normalization is performed slice by slice since cine scans are acquired in this manner. The volumes from ED and ES phases are combined to form a triple-channel input (ED, S, ES) for the network, where S

ED ES represents the subtraction volume of the two phases, explicitly incorporating temporal information.

2.2 Network Architecture

The DenseNet [9] and U-net [10] models are combined to solve the classification and segmentation tasks. Three distinct branches are identified in the network, namely (Fig. 1): main (MB), segmentation (SB) and diagnosis (DB) branches. The composite function for every operation in the model consists of Operation-BN-ReLU

, with BN denoting batch normalization, and ReLU, rectified linear units. MB applies a 7x7x7 convolution with stride 2 in the in-slice direction of the input to generate the initial 64 feature maps. Hereafter, the model follows a

DenseBlock-bottleneck-size manipulation structure. DenseBlocks consist of 3x3x3 convolution layers with growth rate

and an increasing number of layers per block as feature map sizes get smaller. Bottleneck layers apply 1x1x1 convolutions to halve the number of feature maps. Size manipulation layers halve or double feature map sizes depending on the branch. MB applies average pooling to produce optimal features for both tasks, DB applies max pooling to learn optimal features for classification, and SB applies transpose convolutions to produce segmentation masks at original input size. MB and SB manipulate feature map sizes in the in-plane direction while DB manipulates all three axes.

Figure 1: Network architecture consists of three branches. The shared and diagnosis branches form a DenseNet structure while shared and segmentation form a U-net like structure. Six consecutive slices of both phases and their subtraction volume are combined in a three channel input (ED, ED-ES, ES) to include phase information.

2.3 Loss functions

The network is trained using a loss function consisting of a convex combination of a diagnosis classification and segmentation losses:


where the diagnosis loss is evaluated using the standard cross-entropy loss, and the segmentation loss is represented by:


with denoting the number of segmentation labels in the dataset and being a parameter to control the shape of the loss function. If , becomes the binary Dice loss [11]. This linear Dice loss is known to produce less accurate segmentation results in unbalanced datasets, i.e., containing objects of different sizes. To overcome this limitation, we evaluate the loss with different values for .

Figure 2: Results on the validation set for classification and segmentation as a function of in (1) and in (2). Left-top: diagnostic error. Left-bottom: iteration where lowest classification error happens. Right: DSC (top) and Hausdorff distance (bottom) for LV at ED.

The magnitude of the gradient loss in (2) increases for low Dice scores and decreases for high Dice scores if compared to . This behavior emphasizes learning from cases with low Dice scores. For this emphasis is reversed, where low Dice scores are correlated with low gradient magnitude, and high Dice scores induces large gradient magnitude. The emphasis on high performing cases focuses the training process on the hard to learn details of near perfect results. Both strategies could potentially increase segmentation performance. In [8] an exponential logarithmic loss is evaluated that combines both strategies, but in this paper we decided to study each strategy independently.

2.4 Training and Testing Strategies

The training set is split

in equally distributed disease sets for training and validation, respectively. During each training iteration, we randomly sample six consecutive slices from the volume to be used as the input, where the center of the six slices is randomly selected. We relied on such strategy because the dataset volumes have between 6 and 18 slices, and the normalization of this resolution (i.e., all volumes interpolated to have 6 slices) would introduce artifacts that could have a negative impact in the training and inference processes. Additionally, such random selection of the six consecutive slices improved the training convergence and generalization. For the inference, we use as input the center six slices of the test volume. The Adam solver with

is used for training with a learning rate of

. Also, dropout with probability

is applied to the input layer and to every convolution.

3 Results and Discussion

Figure 3: Segmentation results on the validation set – Top row shows Dice Similarity Coefficients and Hausdorff distance as a function of in (1) for all three segmentation anatomies and both phases for . Bottom row shows ground truth (left) vs. segmentation results (right) for and

. From left to right, the cases are: NOR (correcly diagnosed), MINF (correctly classified even with artifact on top left), and ARV (misclassified as NOR). All show

LV in blue, RV in red, and Myo in green.
Model Accuracy
Baseline ( 0.68%
Multi-task ( 0.70%
Multi-task ( 0.78%
Wolterink et al. 0.86%
Isensee et al. 0.92%
Cetin et al. 0.92%
Khened et al. 0.96%
Table 1: Diagnostic accuracy on test set – ours vs. challenge [5].  Improved accuracy to 100% after the challenge.

In this section, for the classification results, we rely on the rate of diagnostic error [5], while the segmentation results rely on Dice similarity coefficient (DSC) and Hausdorff distance [5]. We first study the influence of in (1) and in (2). Results in Fig. 2 show that a low value of corresponds to lower diagnosis classification error, faster convergence, and better (i.e., higher) Dice scores. In fact, the best result with for (i.e., this represents an optimization that consists of only the diagnosis loss) is reached at iteration 680 (not show in figure), while for , the best result is reached faster. The Hausdorff distance is the only metric where performance seems to decrease with a lower value for

, however, this could be due to the sensitivity of this metric to outliers rather than an actual influence of

. All three values of seem to produce similar diagnostic error, and convergence to the lowest diagnosis classification error is faster when , but shows superior DSC and Hausdorff distance results, compared to . Contrary to the other values for , results indicate minimal or no influence of the diagnosis loss on the segmentation results for . This shows that the large gradient magnitude in the segmentation loss for well performing cases overwhelms the influence of classification training, making segmentation performance independent of diagnosis training for lower values of . For on the other hand, low gradient magnitude at high DSC causes the model to focus on the diagnosis loss, reducing segmentation accuracy as increases. The robust diagnosis and segmentation results achieved with makes this model best suited for our multi-task training approach.

Fig. 3 shows consistent accuracy for the segmentation of all anatomies ({LV,RV,Myo} at {ED,ES}) over a large range of values for . In terms of DSC, Myo is the hardest anatomy to segment, while regarding Hausdorff distance, RV appears to be challenging because of the sensitivity of this distance measure to outliers. These results are around to worse than the best ones in the ACDC challenge [5]. However, no direct comparison can be made with the ACDC challenge results [5] as we did not obtain results on the test set because of the inference approach described in Sec. 2.4 consisting of the assessment of the six central slices per volume.

In diagnosis classification, we evaluate three values of on the test set of the ACDC challenge [5] (with in Eq. (2)). Table 1 shows an increase in accuracy when decreases, which is consistent with the observations in Fig. 2 on the validation set. The decrease in accuracy from to can in part be explained by their difference in segmentation performance. Of the fifteen misclassifications for , seven involve the ARV class. This coincides with the observed sharp increase of the RV Hausdorff distance around in Fig. 3. DSC and Hausdorff distances score well for and no clear observations can be made about the origin of misclassifications without looking at the images. As the test server provides no information on individual cases, we perform further evaluation on the validation set.

The bottom row of Fig. 3 shows the segmentation results along with their ground truth for and , where three out of twenty-five cases are misdiagnosed (we show two correct and one incorrect diagnosed case). For the case that has been incorrectly diagnosed in Fig. 3 (rightmost image, ARV misclassified as NOR), the shown under-segmented RV segmentation is representative of the entire ES phase, while for the ED phase, the RV is correctly segmented. Given that ARV relies on RV ejection fraction, such mistake in the segmentation would suggest adequate myocardial contraction and explain why this case is classified as normal. This mistake provides evidence that the features used for training the classification parameters may be strongly correlated with segmentation accuracy. The other two misclassifications involve scans that have imaging artifacts near the heart, similar to the middle image in the bottom row of Fig. 3. Interestingly, the only four cases in the validation set that contain imaging artifacts have a softmax probability (i.e., classification confidence) of around 0.7, while all other scans have a probability near 1. Segmentation performance is unaffected by such artifacts, and two of these four cases are still correctly classified, but it does show that classification performance suffers in scans with imaging artifacts.

Comparing our results to the ACDC challenge results (Table 1) that used handcrafted features for diagnosis, we see that the accuracy of our model needs to be improved by to to become competitive. This was expected as the handcrafted features used by the state-of-the-art methods are the same as those used in clinical diagnosis. However, our model shows promising results and could possibly reach similar performance when larger datasets become available. Furthermore, the ACDC challenge organizers excluded ambiguous cases that contained diagnostic boundary values for the handcrafted features. This design choice provided large margins for the classifiers to place their decision boundaries on when using handcrafted features. It is likely that adding these clinical boundary cases will have a significant impact on the performance of such methods.

4 Conclusion

In this paper, we show the first competitive end-to-end diagnosis and segmentation training from CMR imaging. We show that multi-task training can converge faster and reduce the diagnostic error from to compared to a baseline method trained without segmentation. To the best of our knowledge, this is the best result of an end-to-end segmentation and classification method for diagnosing from CMR. Nevertheless, our results need to be improved further before they become competitive with state-of-the-art methods that rely on handcrafted features. We believe that his is simply a matter of increasing the dataset used for training, so we plan to focus on this issue as our future research activity.


  • [1] World Health Organization (WHO), “Mortality database,” www.who.int/healthinfo/mortality_data/en/, 2017.
  • [2] M. Salerno et al., “Recent advances in cardiovascular magnetic resonance: techniques and applications,” Circulation: Cardiovascular Imaging, vol. 10, no. 6, pp. e003951, 2017.
  • [3] P. Peng et al., “A review of heart chamber segmentation for structural and functional analysis using cardiac magnetic resonance imaging,” Magn Reson Mater Phy, vol. 29, no. 2, pp. 155–195, 2016.
  • [4] P. Medrano-Gracia et al., “Challenges of cardiac image analysis in large-scale population-based studies,” Current cardiology reports, vol. 17, no. 3, pp. 9, 2015.
  • [5] O. Bernard et al., “Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: Is the problem solved?,” IEEE Trans. Med. Imag., 2018.
  • [6] G. Litjens et al., “A survey on deep learning in medical image analysis,” MedIA, vol. 42, pp. 60–88, 2017.
  • [7] K. He et al., “Mask r-cnn,” IEEE PAMI, 2018.
  • [8] K.C.L. Wong et al., “3d segmentation with exponential logarithmic loss for highly unbalanced object sizes,” in MICCAI. 2018, pp. 612–619, Springer.
  • [9] G. Huang et al., “Densely connected convolutional networks,” in CVPR, 2017, vol. 1, p. 3.
  • [10] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI. 2015, pp. 234–241, Springer.
  • [11] F. Milletari, N. Navab, and S.A. Ahmadi,

    “V-net: Fully convolutional neural networks for volumetric medical image segmentation,”

    in 3D Vision (3DV). 2016, pp. 565–571, IEEE.