Log In Sign Up

Explainable Shape Analysis through Deep Hierarchical Generative Models: Application to Cardiac Remodeling

Quantification of anatomical shape changes still relies on scalar global indexes which are largely insensitive to regional or asymmetric modifications. Accurate assessment of pathology-driven anatomical remodeling is a crucial step for the diagnosis and treatment of heart conditions. Deep learning approaches have recently achieved wide success in the analysis of medical images, but they lack interpretability in the feature extraction and decision processes. In this work, we propose a new interpretable deep learning model for shape analysis. In particular, we exploit deep generative networks to model a population of anatomical segmentations through a hierarchy of conditional latent variables. At the highest level of this hierarchy, a two-dimensional latent space is simultaneously optimised to discriminate distinct clinical conditions, enabling the direct visualisation of the classification space. Moreover, the anatomical variability encoded by this discriminative latent space can be visualised in the segmentation space thanks to the generative properties of the model, making the classification task transparent. This approach yielded high accuracy in the categorisation of healthy and remodelled hearts when tested on unseen segmentations from our own multi-centre dataset as well as in an external validation set. More importantly, it enabled the visualisation in three-dimensions of the most discriminative anatomical features between the two conditions. The proposed approach scales effectively to large populations, facilitating high-throughput analysis of normal anatomy and pathology in large-scale studies of volumetric imaging.


page 1

page 5

page 6

page 7

page 8

page 12

page 13


Learning Interpretable Anatomical Features Through Deep Generative Models: Application to Cardiac Remodeling

Alterations in the geometry and function of the heart define well-establ...

A Generative Shape Compositional Framework: Towards Representative Populations of Virtual Heart Chimaeras

Generating virtual populations of anatomy that capture sufficient variab...

Learning Population-level Shape Statistics and Anatomy Segmentation From Images: A Joint Deep Learning Model

Statistical shape modeling is an essential tool for the quantitative ana...

Cardiac Segmentation with Strong Anatomical Guarantees

Convolutional neural networks (CNN) have had unprecedented success in me...

Factorised spatial representation learning: application in semi-supervised myocardial segmentation

The success and generalisation of deep learning algorithms heavily depen...

Global and Local Interpretability for Cardiac MRI Classification

Deep learning methods for classifying medical images have demonstrated i...

On the Evaluation and Validation of Off-the-shelf Statistical Shape Modeling Tools: A Clinical Application

Statistical shape modeling (SSM) has proven useful in many areas of biol...

Code Repositories


Explainable Shape Analysis through Deep Generative Models

view repo


Explainable Shape Analysis through Deep Generative Models

view repo

I Introduction

The quantification of anatomical changes and their relationship with disease is a fundamental task in medical image analysis, ultimately leading to new clinical insights and enhanced risk assessment and treatment. Recent improvements in the medical image analysis field have been characterised by an increase of large-scale population-based initiatives [1], [2] together with development of automated segmentation pipelines of anatomical structures, which recently achieved human-level performance [3]. In this context, the development of novel data-driven processing tools to enable quantitative assessment of the differences between normal anatomy and pathology has now received significant interest [4], for instance in the study of cardiac remodelling where standard volumetric and mass analysis proved to be sub-optimal [5].

Cardiac remodelling is a clinical term referring to any change in the heart shape and structure and it is a strong predictor of the clinical course of heart failure [5], [6]. The gold-standard imaging technique to assess cardiac remodelling is cardiovascular magnetic resonance (MR) which enables the imaging of the heart at high-resolution and in three-dimensions (3D) [7]. However, classification and risk-stratification of cardiac-disease patients still rely on scalar indexes describing cardiac remodeling (e.g. left ventricular (LV) mass or ejection fraction) which neglect regional or asymmetric effects that occur during pathology, especially at early stages [5], [6]. This proved to be critical in conditions such as hypertrophic cardiomyopathy (HCM), which is defined by the presence of LV hypertrophy that cannot be solely explained by abnormal loading conditions [8], and can manifest in complex remodeling patterns not readily quantifiable using volumetric indices [9].

In the last decade, computational extensions of conventional volumetric analysis have been introduced to study the 3D and regional nature of anatomical shape changes and to learn better shape descriptors [1], [10], [11], [12]. In the literature, the analysis of large amounts of 3D anatomical shapes has been typically addressed through statistical shape models [13]

, where principal component analysis (PCA) or similar dimensionality reduction techniques are used to analyse populations of structures of interest in terms of a mean template shape and its main modes of shape variation. These modes are then employed in the discrimination of distinct groups of subjects by means of their shape differences or to identify the shape features mostly associated with pathology. These approaches have been successful in many applications, including the visualisation of differences in heart geometry between preterm born individuals and controls

[14], the quantification of the main components of global shape variation in asymptomatic adults [15], [16]

, the discovery of main LV shape variation modes associated with myocardial infarction by means of logistic regression

[17] and the assessment of their association with conventional cardiovascular risk factors [18]. However, despite the popularity of PCA-based approaches, previous work has shown that LV PCA shape components do not have clear clinical interpretation beyond the first two modes [17]. Moreover, the main modes of shape variation do not necessarily encode the anatomical information needed to differentiate disease classes. For this purpose, approaches that search for alternative axes of variation facilitating this differentiation have been proposed, including shape decomposition into orthogonal modes optimally related to clinical remodeling indices [11], [12].

Machine learning approaches have achieved outstanding results in the medical image analysis domain, such as in the discrimination of physiological versus pathological hypertrophy patterns from multiple manually-derived cardiac indices [19] and of dilated cardiomyopathy patients versus controls [20]. In particular, deep learning methods proved to be powerful features extractors for the classification of cardiovascular diseases [21], [22]. Despite their tremendous success, however, a major drawback is their lack of interpretability, which currently hampers their translation to clinical practice due to the fact that the physiological reason that drives the classification result is often as important as the classification result itself [22].

In this work, we propose a new deep learning approach to learn a hierarchy of conditional latent variables that (1) models a population of anatomical segmentations of interest, (2) enables the classification of distinct clinical conditions by using the highest level of the hierarchy and (3) whose anatomical effect can be visualised and quantified in the original segmentation space. These contributions are achieved by specialising the highest level of a deep hierarchical generative model for the classification of distinct clinical conditions. As a consequence, thanks to the generative properties of the model, distinct segmentations corresponding to different values of the highest level can be generated, making the classification model interpretable. In addition, by constraining the highest level to be two-dimensional, the feature space in which the classification is performed can also be directly visualised. Therefore, our approach consists in an automated data-driven tool which enables the detailed analysis of the anatomical remodelling patterns associated with a large number of clinical conditions.

Ii Related Work

An autoencoder is a non-linear dimensionality reduction technique which learns a compact feature representation of the input data by encoding it into and decoding it from a low-dimensional feature vector. Deep autoencoder-based architectures have achieved wide success in computer vision applications as an extension of PCA-based approaches, including feature learning of 3D objects

[23]. Autoencoder-based models have also been used to learn compact representations of medical images [21]. Relevant to this work, Oktay et al. [24] showed how autoencoder-derived features of LV segmentations can be successfully used to constrain deep networks for medical image analysis tasks, and how these features outperform PCA features in the classification of healthy subjects versus dilated cardiomyopathy and HCM patients.

Deep generative models have demonstrated great performance in learning data distributions over a low-dimensional set of latent variables and in generating new unseen samples, which is not possible with standard autoencoder models. Among this class of models, variational autoencoder (VAE) models [25] learn a continuous latent representation by enforcing it to follow a predefined distribution. VAEs have been successful at learning the latent space representing deforming 3D shapes for a variety of applications, including shape space embedding and generation, outperforming state-of-the-art methods [26], [27]. In the medical imaging domain, VAEs have been exploited to approximate the distribution and likelihood of previously unseen MR images [28], to learn a low-dimensional manifold of 3D fetal skull segmentations [29] and to learn a low-dimensional probabilistic deformation model for cardiac image registration [30].

Shakeri et al. [31]

employed a VAE model to learn a low-dimensional representation of co-registered hippocampus meshes, which was employed in conjuction with a multi-layered perceptron (MLP) to classify healthy subjects from Alzheimer disease patients. The network input consisted of mesh vertices coordinates, and the representation was learned through two fully connected layers. Similarly, in our preliminary work


we modified the 3D convolutional VAE framework in order to learn a low-dimensional latent representation of 3D LV segmentations, which was not only able to encode the 3D segmentations manifold, but also to discriminate different conditions by performing the classification task in the latent space. In the same work, we proposed a latent space navigation method to explore the anatomical variability encoded by the learned latent space. This consisted in iteratively modifying the latent representation of a segmentation obtained from an healthy subject along the direction that maximized its probability to be classified as pathological. By decoding the different latent representations in the original space of the segmentations, our technique allowed the visualisation of the anatomical changes caused by this transformation. However, the proposed approach has limitations. In particular, the learned VAE latent space not only encoded the factors of variation that most discriminate between classes, but also all the factors of variation that regulate shape appearance (hence the need for the aforementioned latent space navigation method). Moreover, an additional offline dimensionality reduction step was needed to visualise the obtained clustering in the VAE latent space. These limitations can be found also in our later work


, where a supervised denoising autoencoder is used to learn a latent code representation of right ventricular contraction patterns and, at the same time, enables survival prediction. Finally, the latent space navigation method proposed in our preliminary work can only obtain subject-specific paths (with no obvious navigation stopping criteria), and not the more clinically-appealing population-based inferences from a database of shapes.

Hierarchical VAEs are a class of generative models that decompose the input data into a hierarchical representation [34], [35]. Although highly flexible, these models have been traditionally difficult to optimise, especially in the training of their higher levels, as often their lowest layer alone can contain enough information to reconstruct the data distribution, and the other levels are ignored. In this work, we focus on the ladder VAE (LVAE) framework [35], which was shown to be capable of learning a deeper and more distributed latent representation by combining the approximate likelihood and the data-driven prior latent distribution at each level of the generative model.


In this paper, we aim to extend our preliminary work [32] on classification and visualisation of discriminative features by employing LVAEs, with the aim of assisting clinicians in quantifying the morphological changes related to disease, and in order to develop medical image classifiers that can visualise the morphological features driving the classification result. The main contributions of this work can be described as follows:

  • We demonstrate that an interpretable classifier of anatomical shapes can be developed by performing a classification task of interest in the highest level of a LVAE model. In this way, the latent variables of this level automatically encode the most discriminative features for the task under exam, while the other subsequent levels model the remaining factors of anatomical variation in the data.

  • We show that the LVAE highest latent space can be assumed to be two-dimensional so that the classification space can be directly visualised without further offline dimensionality reduction steps. Furthermore, we demonstrate how the anatomical variability encoded by this latent space can be visualised in the original space of the segmentations thanks to the generative properties of the model, enabling the visualisation of the anatomical effect of the most discriminative features between different conditions.

  • We demonstrate how the proposed LVAE-based method achieves high classification accuracy of HCM versus healthy 3D LV segmentations and how the model captures and enables the easy visualisation of the most discriminative features between the two conditions. Moreover, we show that the learned hierarchical representations provide higher reconstruction accuracy compared to single-latent-space VAEs.

  • While hierarchical VAEs have been mainly evaluated on benchmark datasets, here we successfully apply them on a real-world 3D medical imaging dataset. We show insights on the model functioning and optimal training, and we make the implementation of proposed method publicly available111
    DOI 10.5281/zenodo.3247898

Iii Methods

This section is organised as follows. First, in subsection A and B, we summarise the theoretical foundations of the proposed method. Second, in subsection C, we describe our modifications to the original VAE and LVAE frameworks towards explainable shape analysis (graphical models in Fig. 1). Then, in subsection D, we describe the datasets used in this work for the classification of healthy subjects vs HCM patients. Finally, in subsection E, we provide a detailed description of the LVAE model used in this work (model summary in Fig. 2).

Fig. 1: Graphical models of a standard VAE (a), of our previously proposed method [32] (b) and the new LVAE-based approach (c). Schematic representation of a three-level LVAE (d) and of the flow of information (e).

Iii-a Variational Autoencoder (VAE)

Given a training set of anatomical segmentations of a structure of interest from a population , a VAE [25] is a probabilistic generative model that aims at learning the distribution of the population of segmentations under study. The distribution is learned from the data by using a model of latent variables , where and is the number of pixels/voxels in a segmentation . The VAE graphical model is depicted in Fig. 1 (a) and the generative model is defined as


where is the prior distribution over the variables and are the learnable parameters of the model. However, directly optimising for the segmentations of the training set is computationally infeasible, as it requires to compute the integral in Eq. 1 over all the values. This issue is addressed by performing variational inference. In variational inference, an inference (or encoder) network is used to approximate , the real distribution of values conditioned on the input segmentations , being the integral of Eq. 1 dominated by a smaller set of

values. In the VAE framework, this inference network is jointly optimised through stochastic gradient descent together with a generative (or decoder) network

through the variational lower bound:


where KL is the Kullback-Leibler divergence. The first term in the lower bound represents the reconstruction loss, i.e. how accurate is the generative model

in the reconstruction of the segmentation from the latent space values ; the second term will represent how well is similar to its prior distribution (please refer to [36] for a full derivation of this lower bound).

Iii-B Ladder Variational Autoencoder (LVAE)

A Ladder VAE (LVAE) [35] is a hierarchical latent variable model that employs a deep hierarchy of conditional latent variables in the generative model and it is schematised in Fig. 1 (d). The total prior distribution of this model is factorised as:


where the highest latent space () has a prior distribution

which is typically assumed to be a Gaussian distribution with

and (Eq. 5), while the other levels in the hierarchy have their prior values of and that conditionally depend on the upper levels of the ladder (Eq. 4).

The LVAE inference model also differs from a standard VAE. In particular, each layer in the hierarchy of the latent variables is conditioned on the previous stochastic layers and the total inference model is specified by the following fully factorised Gaussian distribution:


In contrast with standard hierarchical VAEs [34], where the inference and prior distributions are computed separately with no explicit sharing of information, the LVAE framework introduces a new inference mechanism. As shown in Fig. 1 (e), at each level

, an approximate likelihood estimation

and of its latent Gaussian distribution parameters is obtained from the encoder branch. This likelihood estimation is combined with the prior estimates and obtained from the generative branch to produce a posterior estimation and of the latent Gaussian distribution at that level . In particular, this sharing mechanism between the inference (encoder) and generative (decoder) branches is performed at each level through a precision-weighted combination of the form:


while and . This combination enables to build a data-dependent posterior distribution at each level,

, that is both a function of the values assumed in the higher levels of the generative model and of the inference information derived of the subsequent (lower) levels. The loss function of the LVAE is the same of a VAE (Eq.

2) with the only difference that the number of KL divergence terms is equal to the number of levels in the ladder. These KL divergence terms force the learned prior and posterior distributions at each level to be as close as possible. The sharing of information between the encoder and decoder through Eq. 8 promotes the learning of a data-dependent prior distribution better suited for the dataset to be modelled. Moreover, this provides a better and more stable training procedure as the inference (encoder) branch iteratively corrects the generative distribution, instead of learning the posterior and prior values separately [35].
The full LVAE generative model has therefore the following formulation:

Fig. 2: Detailed scheme of the LVAE+MLP architecture adopted in this work. Top: encoder model; Bottom: decoder model. The green arrows indicate the loss function terms used to train the network.

Iii-C LVAE for Interpretable Shape Analysis

In our previous work [32], we proposed a modification of the standard VAE framework presented in Section III-A to include a classification network able to predict the disease class label associated with a segmentation by using its latent representation (the corresponding graphical model is shown in Fig. 1 (b)). In this work, we hypothesise that such modification can be extended to the LVAE framework by attaching a MLP , which classifies the disease status of an input segmentation , to the highest latent space (graphical model in Fig. 1 (c)). By training the LVAE+MLP architecture end-to-end we aim at obtaining a very low-dimensional latent space which encodes the most discriminative features for the classification task under study, while the other latent spaces will encode all the other factors of variation needed to reconstruct the input segmentations . This has two main advantages: 1) template shapes for each disease class can be obtained by sampling from the learned distributions in a top-down fashion (starting from the highest level in the hierarchy and subsequently from every prior ). The posterior

can be estimated by kernel density estimation and, since

is typically very low-dimensional, this estimation is straightforward; 2) if the latent space is designed to be 2D or 3D, the distributions in the classification space can be directly visualised without the need of further offline dimensionality reduction techniques required in previous works [32], [33].

Iii-D Application to Cardiac Remodelling - Datasets

A multi-centre cohort consisting of 686 HCM patients (56.8 14 years, 27% women, 77% Caucasian, HCM diagnosed using standard clinical criteria) and 679 healthy volunteers (40.7 12.7 years, 56% women, 69% Caucasian) was considered for this work. All the subjects were scanned at 1.5-T on Siemens (Erlangen, Germany) or Philips (Best, Netherlands) system using a standard cardiac MR protocol. LV short-axis cine images were acquired with a balanced steady-state free-precession sequence. The end-diastolic (ED) and end-systolic (ES) phases were automatically segmented using a previously published and extensively validated cardiac multi-atlas segmentation framework [16]. As a first post-processing step, the obtained LV short-axis stack segmentations were upsampled using a multi-atlas label fusion approach. For each segmentation, twenty manually annotated high-resolution atlases at ED and ES were warped to the subject space using free-form non-rigid registration and fused with majority vote, leading to an upsampled high-resolution segmentation [37]

. In a second step, all segmentations were aligned onto the same reference space at ED by means of landmark-based and subsequent intensity-based rigid registration to remove pose variations. After extracting the LV myocardium label, each segmentation was cropped and padded to

dimensions using a bounding box positioned at the centre of the LV ED myocardium. This latter operation guarantees shapes to maintain their alignment after cropping. Finally, all segmentations underwent manual quality control in order to discard scans with strong inter-slice motion or insufficient LV coverage, resulting in 426 HCM patients and 431 healthy volunteers that were used for the final analysis. As an additional external testing dataset, ED and ES segmentations from 20 healthy volunteers and 20 HCMs from the ACDC MICCAI’17 challenge training dataset222 were also used (after undergoing pre-processing using the same high-resolution upsampling pipeline explained above).

Fig. 3: Latent space clusters in the highest latent space () obtained by the proposed LVAE+MLP model on both the in-house training and testing datasets as well as on the ACDC dataset (entirely used as an additional testing dataset). Dimension 1 and 2 represent the two dimensions of . On the left, long-axis sections of the reconstructed 3D segmentations at ED and ES obtained by sampling from three points in are shown.

Iii-E Application to Cardiac Remodelling - LVAE+MLP model details

A detailed scheme of the three-level () LVAE+MLP architecture employed in this work for the classification of HCM patients vs healthy subjects is summarised in Fig. 2. For the sake of display clarity the model scheme has been split into two rows: the encoder (inference) branch is shown at the top while the decoder (generative) branch is depicted at the bottom, and the two branches are connected by the latent space . The input of the encoder branch are the 3D LV segmentations at ED and ES for each subject under study, which are presented as a two-channel input (top-left of Fig. 2

). A 3D convolutional encoder compresses them into a 128-dimensional embedding through a series of 3D convolutional layers with stride 2. This embedding is used then as input of a deterministic inference network, which computes the likelihood estimates

and for each level of the hierarchy of latent variables. These estimates are derived by manipulating the input through a series of fully connected layers (black arrows), which are all followed by batch normalisation and elu non-linearity with the only exception of the layers computing and . At the highest latent space ( in this case), a shallow MLP (2 layers) is attached to learn , i.e. to predict the class (HCM or healthy) label corresponding to the input segmentation by just using its latent variable values . These latent variable values are sampled during training from where and and they are also the starting point of the generative process (bottom-right of Fig. 2). At each level of the generative (decoder) network, the prior distribution terms are computed by modifying the values of the previous latent space

through a fully connected layer followed by batch normalization and

elu non-linearity and by a second fully connected layer. These prior values are combined with and through Eq. 8 to obtain the posterior estimates and from which is sampled. Finally, the value of is passed to a 3D convolutional decoder which aims to reconstruct the input segmentations through a series of upsampling and convolutional layers.

The training loss function of the LVAE+MLP network is composed of three contributions: 1) two LV segmentation reconstruction accuracy terms at ED and ES as the overlap (Dice score) between the input segmentation and its reconstruction ; 2) KL divergence terms, penalising discrepancies between the learned prior and posterior distributions at each level and 3) a binary classification cross entropy (CE) term for the classification of healthy vs HCM segmentations. Differently from the standard VAE loss, in our work Dice score is used as a segmentation reconstruction loss. Although this does not exactly correspond to the variational lower-bound of Eq. 2, in our preliminary results this term was found to behave more robustly. All the divergence terms except the one of the highest level () were evaluated between the prior distribution and their poster distribution , while for the highest level the prior distribution was assumed to be a standard Gaussian . The total loss function is


and depends on , and parameters which weight the different terms during training. At testing, a pair of ED and ES LV segmentations are reconstructed by starting from and by assigning to and the values and computed from and , i.e. no sampling is performed from the posterior distribution at each level. To interpret the anatomical information encoded by the highest latent space, at each level , the value of can be assigned to instead of and the segmentations are reconstructed as explained above. In this way, by varying the values of , a set of segmentations at ED and ES can be directly generated for each point in , without using the inference information provided by and . This enables the visualisation of the anatomical information encoded by the highest latent space. Finally, in order to visualise the distribution of a set of segmentations under exam in the highest latent space, the

values of each segmentation can be computed through the inference network and directly plotted in a 2D space. A Tensorflow implementation on the proposed model has been made publicly available on GitHub

DOI 10.5281/zenodo.3247898

Fig. 4: Average healthy and HCM shapes at ED and ES sampled from the two clusters in the highest latent space of proposed LVAE+MLP model. The colormap encodes the vertex-wise wall thickness (WT), measured in mm.
Fig. 5: Point-wise difference in wall thickness (dWT) at ED and ES between the healthy and the HCM average shapes of Fig. 4. Left - lateral wall; Right - septal wall.

Iv Results

Model Training

Our in-house dataset of segmentations from healthy and HCM subjects was randomly divided into train, validation and test sets consisting of a total of 537 (276 from healthy volunteers, 261 from HCMs), 150 (75 from healthy volunteers, 75 from HCMs) and 200 (100 from healthy volunteers, 100 from HCMs) segmentations. We adopted a 3-level LVAE+MLP model (Fig. 2) since adding more levels did neither improve the reconstruction accuracy nor the classification accuracy in the clinical application under exam. The model was trained on a NVIDIA Tesla K80 GPU using Adam optimiser with learning rate equal to and batch size of 16. For the first 40k iterations, data augmentation including rotations around the three standard axis with rotation angles randomly extracted from a Gaussian distribution was applied in order to take into account small mis-registrations. This helped the final model to achieve higher reconstruction accuracy, as it can be seen in the tables reported in the Supplementary Data 1. In the loss function (Eq. 10), the KL weights were fixed to , and while was set to increase from 0 to 100 by steps of 0.5 every 4k iterations. The relative magnitude and ascending order of the KL weights were chosen as they provided the best segmentation reconstruction results (i.e. higher Dice Score). In particular, our experiments showed that an ascending order of the weights improves both classification and reconstruction accuracy in contrast with models having all the weights equal or in descending order (results are shown in Supplementary Data 2). This suggests the higher levels of a LVAE might be more difficult to train, and that a lower KL regularization term helps the training. The model produced similar results when varying these parameters within one order of magnitude, while a further increase in value reduced reconstruction accuracy and a further decrease resulted in model overfitting. The classification loss function weight was instead set to 0.005: we observed that a higher value would have still produced a good model, but at the price of a more unstable training at the early stages. Additionally, the increase of the classification and KL divergence weights during training through the parameter, known as deterministic warm-up [35], proved to be crucial to construct an expressive generative model (see in-depth analysis in Supplementary Data 1). After 220k iterations the training procedure was stopped as the increase of the KL divergence started to interfere with the decrease of the reconstruction and classification losses. In particular, this is due to the fact that in the highest latent space the KL divergence term tries to cluster all the data together, while the classification loss tries to separate the clusters. Hence the relative weight of and needs to be tuned in order to obtain a good equilibrium.

Classification and Reconstruction Results

All the 200 subjects in our testing dataset were correctly classified (100% sensitivity and specificity) by the trained prediction network. The same model also correctly classified 36 out of the 40 ACDC MICCAI 2017 segmentations (100% sensitivity and 80% specificity) without the need of any re-training procedure. Of note, 3 of the 4 misclassified segmentations suffered from a lack of coverage of the LV apex, which might be the cause for the error. The results obtained for the exemplar clinical application are shown in Fig. 3, where two separated clusters of segmentations have been discovered both on the training and on the testing data. An analogous result was obtained in our previous work [32]: however, the previous version of the model required an additional dimensionality reduction step to visualise in 2D the obtained latent space of segmentations, while in the new proposed framework the highest latent space is 2D by design. Moreover, the new model achieved higher reconstruction accuracy than the previous model, as shown in Table V, suggesting that a better generative model of shapes was learned. In particular, the table shows the reconstruction accuracy in terms of 3D Dice score and average 2D slice-by-slice Hausdorff distance between the 3D original and reconstructed segmentations on the testing and training datasets obtained by the proposed LVAE+MLP model and our previous VAE-based model (VAE+MLP) [32]. The VAE+MLP model was constructed with the same 3D convolutional encoder and decoder networks of the LVAE+MLP model and with a single latent space composed of 98 latent variables, which corresponds to the total number of latent variables adopted in the LVAE+MLP model (three levels of 64, 32 and 2 latent variables, respectively). As it can be noticed in the table, the obtained Dice score results at ES are better than at ED for all the models, while the Hausdorff results follow instead an opposite trend. This is probably due to the fact that since the LV is more compact at ES, the Dice score might not be sensitive to small misalignment of the reconstructed shape, which are instead captured by the Hausdorff distance.

Fig. 6: tSNE visualisation of the latent spaces and .

Visualisation of the latent spaces

Thanks to the properties of the proposed model, the anatomical information encoded by each latent space can be directly visualised, especially the information embedded in the highest level (), which encodes the most discriminative features for the classification of healthy and HCMs 3D LV segmentations. For the exemplar application under investigation, little intra-cluster variability between the shapes generated from the latent space was obtained, while much larger inter-cluster variability between the generated shapes was obtained. This can be seen on the left-side of Fig. 3 where we report a long-axis section of the 3D reconstructed segmentations at ED and ES at three points of the latent space (for a grid visualisation of the shapes encoded by this latent space please refer to Supplementary Data 3). Moreover, in Fig. 4 we show the obtained mean average shape for each cluster, represented as a triangular mesh with point-wise wall thickness (WT) values at vertex. This was obtained by sampling segmentations from each cluster in after estimating its probability density via kernel density estimation. Then, the obtained segmentations for each cluster were averaged to extract the corresponding average segmentation. Finally, a non-rigid transformation between the obtained average segmentation and a 3D high-resolution LV segmentation from the UK Digital Heart project444 was computed, and the inverse of this transformation was applied to the corresponding 3D high-resolution LV segmentation to warp it to the cluster specific average segmentation. At each of the mesh vertices, WT was then computed as the perpendicular distance between the endocardial and epicardial wall. The results are presented in Fig. 4, where it can be noticed that the average HCM shape has higher WT than the corresponding healthy shape and it has a slightly reduced size. Fig. 5 instead reports the point-wise difference in WT between the HCM and the healthy shape, and it can be noticed that the most discriminative anatomical feature to classify an HCM shape consists in an increased WT in the septum, which is in agreement with the clinical literature [38]. Fig. 7 shows a long-axis section of the reconstructed segmentations at ED and ES from the LVAE+MLP model when only posterior information is used (first column) or when also the posterior information in the other levels ( is exploited: the latent spaces and evidently encode different anatomical features that help to refine the structural information provided by . Results for more subjects are reported in Supplementary Data 4. Finally, we applied the dimensionality reduction technique tSNE [39] to visualise in two dimensions the distributions of and latent spaces and we have found that the latent representations of the two classes of shapes are clustered also at both these levels (plots shown in Fig. 6). A possible explanation relies on the fact that the generative process is a conditional: if the data is clustered at the top of the hierarchy, it may be easier for the network to keep the clusters also in the subsequent levels.

VAE+MLP vs LVAE+MLP Reconstruction Accuracy
[mm] [mm]
VAE+MLP train
LVAE+MLP train
VAE+MLP test

Dice score (DSC) and average 2D slice-by-slice Hausdorff distance (H) at ED and ES and they standard error of the mean for the proposed LVAE+MLP model and for the VAE+MLP model proposed in

[32] on training and testing datasets.
Fig. 7: Long-axis section of reconstructed segmentations at ED and ES by the LVAE+MLP model, using only information(first column) or also using the posterior information of the other latent spaces . Last column: ground-truth (GT) segmentation. DSC = Dice Score between the segmentation at that column and the GT.

V Discussion

In this work, we presented a data-driven framework which learns to model a population of 3D anatomical segmentations through a hierarchy of conditional latent variables, encoding at the highest level of the hierarchy the most discriminative features to differentiate distinct clinical conditions. This is achieved by implementing and extending for the first time the LVAE framework to a real-world medical imaging application. In particular, by performing a classification task in the highest level of the LVAE hierarchy of latent variables, we can force this latent space to encode the most discriminative features for a clinical task under exam, while the other levels encode the other factors of anatomical variation needed to model the manifold of segmentations under analysis. Being a generative model, this framework provides the advantage of enabling the visualisation and quantification in the original segmentation space of the anatomical effect encoded by each latent space. Hence, the anatomical differences used by the classifier to distinguish different conditions can be easily visualised and quantified by sampling from the highest level posterior distribution computed from a given database of shapes. Moreover, by designing this latent space to be two or three dimensional, no additional offline dimensionality reduction technique is required to visually assess the distribution of these shapes in the latent space. As a consequence, this method not only provides a deep learning classifier that uses a task-specific latent space in the discrimination of different clinical conditions, but also enables the visualisation of the anatomical features encoded by this latent space, making the classification task transparent.

With the aim of assisting the cardiologists in quantifying the morphological changes related to cardiac conditions, we have applied the proposed framework for the automatic classification of cardiovascular pathologies associated with cardiac remodeling. In particular, we focused on the classification of HCM patients versus healthy volunteers using only a 3D segmentation at both ED and ES phases. In the reported exemplar clinical application, the learned features achieved high accuracy in the discrimination of healthy subjects from HCM patients on our unseen testing dataset and on a second external testing dataset from the ACDC MICCAI 17 challenge. Moreover, the visualisation of the features encoded in the highest level of the adopted LVAE+MLP model showed how the proposed model is able to provide the clinicians with a 3D visualisation of the most discriminative features for the task under study, making the data-driven assessment of regional and asymmetric remodelling patterns characterizing a given clinical condition possible.

We have also showed that the proposed LVAE+MLP model allows the construction of a better generative model in comparison to a VAE-based model with a single latent space [32]. To the best of our knowledge, this result confirms for the first time that hierarchical latent spaces provide a more accurate generative model on a real clinical dataset. Moreover, this work also gives insights on the functioning of these models on 3D anatomical segmentations, including how the different levels of latent variables encode different anatomical features that define the shapes being reconstructed, and how to optimally train this class of models for the reconstruction of these 3D anatomical segmentations.

A limitation of the current approach is the fact that the input segmentations need to be rigidly registered to train the model. Future work should consider how to extend the proposed method to unregistered shapes, for example with the introduction of spatial transformer modules inside the architecture. Similarly to most machine learning-based approaches, another limitation is the need for the manual - and frequently data-set specific - tuning of hyper-parameters, such as the number of adopted levels in the ladder and their weights importance in the model loss function. However, the provided detailed description of the values chosen in our implementation for these hyperparameters could be of great help when applying these model to other datasets.

The prior distribution adopted in the highest latent space is a standard Gaussian distribution

: future work could consider alternative prior distributions which could further favour the clustering of shapes. Even more interestingly, the interpretability and visualisation properties of the proposed method indicate that it could constitute an interesting tool for unsupervised clustering of shapes, for example by learning in the highest level discrete random variables. Finally, while this work showed the potentialities of the method for the classification task of healthy and HCM shapes, the proposed method is domain-agnostic and can be applied to any classification problem where pathological remodelling is a predictor of a disease class label.

Vi Conclusions

In recent years, the medical image analysis field has witnessed a marked increase both in the construction of large-scale population-based imaging databases and in the development of automated segmentation frameworks. As a consequence, the need for novel approaches to process and extract clinically relevant information from the collected data has greatly increased. In this work, we proposed a method for data-driven shape analysis which enables the classification of different groups of clinical conditions through a very low-dimensional set of task-specific features. Moreover, this framework naturally enables the quantification and visualisation of the anatomical effects encoded by these features in the original space of the segmentations, making the classification task transparent. As a consequence, we believe that this method will be a powerful tool for the study of both normal anatomy and pathology in large-scale studies of volumetric imaging.


  • [1] C. G. Fonseca, M. Backhaus, D. A. Bluemke, R. D. Britten, J. D. Chung, B. R. Cowan, I. D. Dinov, J. P. Finn, P. J. Hunter, A. H. Kadish et al., “The Cardiac Atlas Project: an imaging database for computational modeling and statistical atlases of the heart,” Bioinformatics, vol. 27, no. 16, pp. 2288–2295, 2011.
  • [2] R. Attar, M. Pereañez, A. Gooya, X. Albà, L. Zhang, M. H. de Vila, A. M. Lee, N. Aung, E. Lukaschuk, M. M. Sanghvi et al., “Quantitative CMR Population imaging on 20,000 subjects of the UK Biobank imaging study: LV/RV quantification pipeline and its evaluation,” Medical Image Analysis, 2019.
  • [3] W. Bai, M. Sinclair, G. Tarroni, O. Oktay, M. Rajchl, G. Vaillant, A. M. Lee, N. Aung, E. Lukaschuk, M. M. Sanghvi et al., “Automated cardiovascular magnetic resonance image analysis with fully convolutional networks,” Journal of Cardiovascular Magnetic Resonance, vol. 20, no. 1, p. 65, 2018.
  • [4]

    J. L. Bruse, M. A. Zuluaga, A. Khushnood, K. McLeod, H. N. Ntsinjana, T.-Y. Hsia, M. Sermesant, X. Pennec, A. M. Taylor, and S. Schievano, “Detecting clinically meaningful shape clusters in medical image data: metrics analysis for hierarchical clustering applied to healthy and pathological aortic arches,”

    IEEE Transactions on Biomedical Engineering, vol. 64, no. 10, pp. 2373–2383, 2017.
  • [5] F. Triposkiadis, J. Butler, F. M. Abboud, P. W. Armstrong, S. Adamopoulos, J. J. Atherton, J. Backs, J. Bauersachs, D. Burkhoff, R. O. Bonow et al., “The continuous heart failure spectrum: moving beyond an ejection fraction classification,” European Heart Journal, 2019.
  • [6] J. N. Cohn, R. Ferrari, N. Sharpe et al., “Cardiac remodeling - concepts and clinical implications: a consensus paper from an international forum on cardiac remodeling,” Journal of the American College of Cardiology, vol. 35, no. 3, pp. 569–582, 2000.
  • [7] A. F. members, P. M. Elliott, A. Anastasakis, M. A. Borger, M. Borggrefe, F. Cecchi, P. Charron, A. A. Hagege, A. Lafont, G. Limongelli et al., “2014 ESC guidelines on diagnosis and management of hypertrophic cardiomyopathy: the task force for the diagnosis and management of Hypertrophic Cardiomyopathy of the European Society of Cardiology (ESC),” European Heart Journal, vol. 35, no. 39, pp. 2733–2779, 2014.
  • [8] C. W. Yancy, M. Jessup, B. Bozkurt, J. Butler, D. E. Casey, M. H. Drazner, G. C. Fonarow, S. A. Geraci, T. Horwich, J. L. Januzzi et al., “2013 ACCF/AHA guideline for the management of heart failure: a report of the American College of Cardiology Foundation/American Heart Association Task Force on practice guidelines,” Journal of the American College of Cardiology, vol. 62, no. 16, pp. e147–e239, 2013.
  • [9] G. Captur, C. Y. Ho, S. Schlossarek, J. Kerwin, M. Mirabel, R. Wilson, S. Rosmini, C. Obianyo, P. Reant, P. Bassett et al., “The embryological basis of subclinical hypertrophic cardiomyopathy,” Scientific Reports, vol. 6, p. 27714, 2016.
  • [10] A. Lopez-Perez, R. Sebastian, and J. M. Ferrero, “Three-dimensional cardiac computational modelling: methods, features and applications,” Biomedical Engineering Online, vol. 14, no. 1, p. 35, 2015.
  • [11] X. Zhang, P. Medrano-Gracia, B. Ambale-Venkatesh, D. A. Bluemke, B. R. Cowan, J. P. Finn, A. H. Kadish, D. C. Lee, J. A. Lima, A. A. Young et al., “Orthogonal decomposition of left ventricular remodeling in myocardial infarction,” GigaScience, vol. 6, no. 3, p. gix005, 2017.
  • [12] A. Suinesiaputra, P. Ablin, X. Alba, M. Alessandrini, J. Allen, W. Bai, S. Cimen, P. Claes, B. R. Cowan, J. D’hooge et al., “Statistical shape modeling of the left ventricle: myocardial infarct classification challenge,” IEEE Journal of Biomedical and Health Informatics, vol. 22, no. 2, pp. 503–515, 2018.
  • [13] A. A. Young and A. F. Frangi, “Computational cardiac atlases: from patient to population and back,” Experimental Physiology, vol. 94, no. 5, pp. 578–596, 2009.
  • [14] A. J. Lewandowski, D. Augustine, P. Lamata, E. F. Davis, M. Lazdam, J. Francis, K. McCormick, A. R. Wilkinson, A. Singhal, A. Lucas et al., “Preterm heart in adult life: cardiovascular magnetic resonance reveals distinct differences in left ventricular mass, geometry, and function,” Circulation, vol. 127, no. 2, pp. 197–206, 2013.
  • [15] P. Medrano-Gracia, B. R. Cowan, B. Ambale-Venkatesh, D. A. Bluemke, J. Eng, J. P. Finn, C. G. Fonseca, J. A. Lima, A. Suinesiaputra, and A. A. Young, “Left ventricular shape variation in asymptomatic populations: the multi-ethnic study of atherosclerosis,” Journal of Cardiovascular Magnetic Resonance, vol. 16, no. 1, p. 56, 2014.
  • [16] W. Bai, W. Shi, A. de Marvao, T. J. Dawes, D. P. O’Regan, S. A. Cook, and D. Rueckert, “A bi-ventricular cardiac atlas built from 1000+ high resolution MR images of healthy subjects and an analysis of shape and motion,” Medical Image Analysis, vol. 26, no. 1, pp. 133–145, 2015.
  • [17] X. Zhang, B. R. Cowan, D. A. Bluemke, J. P. Finn, C. G. Fonseca, A. H. Kadish, D. C. Lee, J. A. Lima, A. Suinesiaputra, A. A. Young et al., “Atlas-based quantification of cardiac remodeling due to myocardial infarction,” PloS One, vol. 9, no. 10, p. e110243, 2014.
  • [18] K. Gilbert, W. Bai, C. Mauger, P. Medrano-Gracia, A. Suinesiaputra, A. M. Lee, M. M. Sanghvi, N. Aung, S. K. Piechnik, S. Neubauer et al., “Independent left ventricular morphometric atlases show consistent relationships with cardiovascular risk factors: A UK Biobank study,” Scientific Reports, vol. 9, no. 1, p. 1130, 2019.
  • [19] S. Narula, K. Shameer, A. M. S. Omar, J. T. Dudley, and P. P. Sengupta, “Machine-learning algorithms to automate morphological and functional assessments in 2D echocardiography,” Journal of the American College of Cardiology, vol. 68, no. 21, pp. 2287–2295, 2016.
  • [20] E. Puyol-Antón, B. Ruijsink, B. Gerber, M. S. Amzulescu, H. Langet, M. De Craene, J. A. Schnabel, P. Piro, and A. P. King, “Regional multi-view learning for cardiac motion analysis: Application to identification of dilated cardiomyopathy patients,” IEEE Transactions on Biomedical Engineering, vol. 66, no. 4, pp. 956–966, 2019.
  • [21] G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. Van Der Laak, B. Van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” Medical Image Analysis, vol. 42, pp. 60–88, 2017.
  • [22] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G. Ballester et al., “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?” IEEE Transactions on Medical Imaging, vol. 37, no. 11, pp. 2514–2525, 2018.
  • [23] Y. Fang, J. Xie, G. Dai, M. Wang, F. Zhu, T. Xu, and E. Wong, “3D deep shape descriptor,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2015, pp. 2319–2328.
  • [24] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. A. Cook, A. De Marvao, T. Dawes, D. P. O‘Regan et al.

    , “Anatomically constrained neural networks (ACNNs): application to cardiac image enhancement and segmentation,”

    IEEE Transactions on Medical Imaging, vol. 37, no. 2, pp. 384–395, 2018.
  • [25] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  • [26] C. Nash and C. K. Williams, “The shape variational autoencoder: A deep generative model of part-segmented 3D objects,” in Computer Graphics Forum, vol. 36, no. 5.   Wiley Online Library, 2017, pp. 1–12.
  • [27] Q. Tan, L. Gao, Y.-K. Lai, and S. Xia, “Variational autoencoders for deforming 3D mesh models,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5841–5850.
  • [28] K. C. Tezcan, C. F. Baumgartner, R. Luechinger, K. P. Pruessmann, and E. Konukoglu, “MR image reconstruction using deep density priors,” IEEE Transactions on Medical Imaging, 2018.
  • [29] J. J. Cerrolaza, Y. Li, C. Biffi, A. Gomez, M. Sinclair, J. Matthew, C. Knight, B. Kainz, and D. Rueckert, “3D Fetal Skull Reconstruction from 2DUS via Deep Conditional Generative Networks,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2018, pp. 383–391.
  • [30] J. Krebs, H. e Delingette, B. Mailhé, N. Ayache, and T. Mansi, “Learning a probabilistic model for diffeomorphic registration,” IEEE Transactions on Medical Imaging, 2019.
  • [31] M. Shakeri, H. Lombaert, S. Tripathi, S. Kadoury, A. D. N. Initiative et al., “Deep spectral-based shape features for alzheimer’s disease classification,” in International Workshop on Spectral and Shape Analysis in Medical Imaging.   Springer, 2016, pp. 15–24.
  • [32] C. Biffi, O. Oktay, G. Tarroni, W. Bai, A. De Marvao, G. Doumou, M. Rajchl, R. Bedair, S. Prasad, S. Cook et al., “Learning interpretable anatomical features through deep generative models: Application to cardiac remodeling,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2018, pp. 464–471.
  • [33] G. A. Bello, T. J. Dawes, J. Duan, C. Biffi, A. de Marvao, L. S. Howard, J. S. R. Gibbs, M. R. Wilkins, S. A. Cook, D. Rueckert et al., “Deep-learning cardiac motion analysis for human survival prediction,” Nature Machine Intelligence, vol. 1, no. 2, p. 95, 2019.
  • [34] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximate inference in deep generative models,” arXiv preprint arXiv:1401.4082, 2014.
  • [35] C. K. Sønderby, T. Raiko, L. Maaløe, S. K. Sønderby, and O. Winther, “Ladder variational autoencoders,” in Advances in neural information processing systems, 2016, pp. 3738–3746.
  • [36] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016.
  • [37] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. Hill, M. O. Leach, and D. J. Hawkes, “Nonrigid registration using free-form deformations: application to breast MR images,” IEEE Transactions on Medical Imaging, vol. 18, no. 8, pp. 712–721, 1999.
  • [38] M. Y. Desai, S. R. Ommen, W. J. McKenna, H. M. Lever, and P. M. Elliott, “Imaging phenotype versus genotype in hypertrophic cardiomyopathy,” Circulation: Cardiovascular Imaging, vol. 4, no. 2, pp. 156–168, 2011.
  • [39] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, no. Nov, pp. 2579–2605, 2008.

Supplementary Data 1 - Effect of Deterministic Warm-Up (DWU) and Data Augmentation (DA)

Effect of DA and DWU
C [%]
TABLE II: Dice score (DSC) and average 2D slice-by-slice Hausdorff distance (H) at ED and ES and they standard error of the mean for the proposed LVAE+MLP model when Deterministic Warm-Up (DWU) and Data Augmentation (DA) are applied. Results obtained on the training dataset.
Effect of DA and DWU
TABLE III: Dice score (DSC) and average 2D slice-by-slice Hausdorff distance (H) at ED and ES and they standard error of the mean for the proposed LVAE+MLP model when Deterministic Warm-Up (DWU) and Data Augmentation (DA) are applied. Results obtained on the testing dataset.

Supplementary Data 2 - Effect of the KL weights

Effect of the KL weights
TABLE IV: Dice score (DSC), average 2D slice-by-slice Hausdorff distance (H) at ED and ES and they standard error of the mean together with classification accuracy (C) for the proposed LVAE+MLP model for different sets of the KL weights in the training loss function. Results obtained on the training dataset.
Effect of the KL weights
TABLE V: Testing data results. Dice score (DSC), average 2D slice-by-slice Hausdorff distance (H) at ED and ES and they standard error of the mean together with classification accuracy (C) for the proposed LVAE+MLP model for different sets of the KL weights in the training loss function. Results obtained on the testing dataset.

Supplementary Data 3 - Visualisation of

Fig. 8: Long-axis section of reconstructed segmentations at ED by the LVAE+MLP model by sampling from different points in and subsequently from the prior distribution of the latent variables and .
Fig. 9: Long-axis section of reconstructed segmentations at ES by the LVAE+MLP model by sampling from different points in and subsequently from the prior distribution of the latent variables and .

Supplementary Data 4 - Additional Reconstruction Examples

Fig. 10: Long-axis section of an healthy subject reconstructed segmentations at ED and ES by the LVAE+MLP model using only information (first column) or also using the posterior information of the other latent spaces . Last column: ground-truth (GT) segmentation. DSC = Dice Score between the segmentation at that column and the GT.
Fig. 11: Long-axis section of an HCM patient reconstructed segmentations at ED and ES by the LVAE+MLP model using only information (first column) or also using the posterior information of the other latent spaces . Last column: ground-truth (GT) segmentation. DSC = Dice Score between the segmentation at that column and the GT.
Fig. 12: Long-axis section of an HCM patient reconstructed segmentations at ED and ES by the LVAE+MLP model using only information (first column) or also using the posterior information of the other latent spaces . Last column: ground-truth (GT) segmentation. DSC = Dice Score between the segmentation at that column and the GT.