I Introduction
Cardiac magnetic resonance (CMR) is the goldstandard technique for assessment of cardiac morphology. Conventional practice is to acquire a stack of breathhold 2D image sequence in the left ventricular (LV) short axis supplemented by long axis image sequence in prescribed planes to enable reproducible volumetric analysis and diagnostic assessment [1]. Disadvantages of this approach for wholeheart segmentation are low throughplane resolution, misalignment between breathholds and lack of wholeheart coverage. Highresolution 3D image sequences address some of these issues, but also have disadvantages in terms of long acquisition times, relatively low inplane resolution and lack of clinical availability. However, highresolution 3D segmentations proved to be crucial for the construction of integrative statistical models of cardiac anatomy and physiology and disease characterization [2, 3]. For these reasons, a method to reconstruct a 3D highresolution segmentation from routinelyacquired 2D cines could be highly beneficial  offering high resolution phenotyping robust to artefact in large clinical populations with conventional imaging.
The reconstruction of 3D anatomical structures from a limited number of 2D views has been previously studied via deformable statistical shape models [4]
. However, these methods require complex reconstruction procedures and are very computationallyintensive. In recent years, with the advent of learningbased approaches, and in particular of deep learning, a number of alternative strategies have been proposed. The TLembedding network (TLnet) consists of a 3D convolutional autoencoder (AE) which learns a vector representation of the 3D geometries, whereas a second convolutional neural network attached to the latent space of the AE maps 2D views of the same object to the same vector representation
[5]. More recently, [6] proposed a convolutional conditional variational autoencoder (CVAE) architecture for the 3D reconstruction of the fetal skull from 2D ultrasound standard planes of the head. Finally, [7] showed how a convolutional variational autoencoder (VAE) can learn a shape segmentation model of left ventricular (LV) segmentations and how the learned latent space can be exploited to accurately identify healthy and pathological cases and generate realistic segmentations unseen during training.In this work, we present a CVAE architecture that reconstructs a highresolution 3D segmentation of the LV myocardium from three segmentations of 2D standard cardiac views (one shortaxis and two longaxis). Moreover we show how the proposed model naturally produces confidence maps associated to each reconstruction, unlike deterministic models, thanks to its generative properties.
Ii Materials and Methods
Iia 3D Cardiac Image Acquisition and Segmentation
A highspatial resolution 3D balanced steadystate free precession cine MR image sequence was acquired from 1,912 healthy volunteers of the UK Digital Heart Project at Imperial College London using a 1.5T Philips Achieva system (Best, the Netherlands) [3]. Left and right ventricles were imaged in their entirety in a single breathhold (60 sections, repetition time 3.0 ms, echo time 1.5 ms, flip angle 50, field of view mm, matrix , reconstructed voxel size mm, 20 cardiac phases, temporal resolution 100 ms, typical breathhold 20 s). For each subject, a 3D highresolution segmentation of the LV was automatically obtained using a previously reported technique employing a set of manually annotated atlases [3]. In this work, only the enddiastolic (ED) frame was considered.
IiB Conditional Variational Autoencoder Architecture
The outline of the CVAE architecture we propose is shown in Fig. 1. We aim at reconstructing a 3D highresolution LV segmentation from segmentations obtained in as many 2D views . We aim to learn from the training data a conditional generative model by means of a dimensional latent distribution and a lowdimensional representation of the views . In this work we use a single 2D convolutional neural network (CNN) to encode the 2D views in a lowdimensional representation . An alternative encoding strategy was proposed in [6], using a separate branch for each conditional input of the model. However, whilst this latter approach proved efficient when the views suffer from large inconsistencies or variability (e.g., freehand ultrasound scans), we can notably reduce the model complexity by combining the views as a unique threechannel input as these are consistently acquired in clinical routine.
Directly inferring is impractical as it would require sampling a large number of values. However, variational inference allows us to approximate by introducing a highcapacity function which gives us a distribution over values that are likely to produce . Hence we can learn by minimizing the following objective:
(1) 
where represents the KullbackLeibler (KL) divergence of two distributions (full mathematical derivation of the equation can be found in [8]). The encoding function
can be modelled as a Gaussian distribution parametrized by
and vectors. These two vectors can be learned by encoding the input 3D segmentation we want to reconstruct via a 3D CNN to a set of features , which are then concatenated together with the lower dimensional representation of the views . By concatenating with a fully connected neural network to and we can thus learn .If is modelled by a sufficiently expressive function, then this function will match the real and the term in (1) will be zero. Therefore optimizing the right side of (1) will correspond to optimizing . In this work, the first term of the right side of (1) is computed as the Dice score (DSC) between and its reconstruction , which is the output of the generative model. The second term in (1) can be computed in a closed form if we assume its prior distribution to be , a
dimensional normal distribution with zero mean and unitstandard deviation, and where
is the number of dimensions of the latent space. Therefore the loss function we optimize becomes
.IiC Experimental Setup and Network Training
In this work, we mimicked the two longaxis and the one shortaxis views acquired in a routine acquisition with the following steps: (1) we rigidly aligned all the ground truth 3D highresolution segmentations by performing landmarkbased and subsequent intensitybased rigid registration; (2) we kept only the LV myocardium label and we cropped and padded the segmentations to [x = 80, y = 80, z = 80, t = 1] dimension using a bounding box centered at the centre of mass of the LV myocardium; (3) we sampled three orthogonal views passing through the centre of each segmentation (an example is shown in Fig. 1). Thanks to this process we extracted three 2D views showing the same three LV sections consistently for all subjects. In the following experiments, the ground truth 3D highresolution segmentations and their corresponding 2D views were kept all in the same reference space. Intersubject pose variability will be addressed in future work, potentially with a simple data augmentation strategy.
The dimension of the latent space was fixed to 125 as values smaller than 100 provided less accurate results, while above 125 no further improvements were observed. The dimensionality of the low dimensional representation was kept equal to the dimensionality of to guarantee a balanced contribution to the generative model. Simulations for different values of the parameter in the loss function were performed: low values of () provided better reconstruction results on the training data at the expenses of a strong deviation from normality of the latent space distribution (KL term not converging) causing overfitting. Higher values of () penalized the reconstruction term in favour of a strictly normal latent space, hence providing poorer reconstruction accuracy. In this work we set as this provided good reconstruction accuracy and convergence of the KL term.
Experiments were performed with different numbers of views as conditions for the proposed model. In particular, referring to the first longaxis view as 1, the second longaxis view as 2 and the shortaxis view as 3, we performed the training using either only one view (which we will indicate as CVAE_1), or a combination or two views (CVAE_12, CVAE_23, CVAE_13), or all the three views (CVAE_123). We have also studied the feasibility of training a 2D AE to reconstruct the 3 views and used its encoder as a pretrained conditional encoder (pCVAE_123). Moreover, the reconstruction capability of the proposed architecture was compared with the one of the TLnet [5]. Finally, we compared the reconstruction obtained by a VAE with = (VAE_0) to all our test segmentations, as this represents the best segmentation that the generative model can reconstruct when no information is provided to it. Results obtained with an autoencoder (AE) are also reported since this model yielded better results than different VAEs with distinct values as it only optimizes the reconstruction accuracy. All the models share the same 3D encoder and decoder architectures.
The dataset was split into training, evaluation and testing sets consisting of 1362, 150 and 400 subjects respectively. Data augmentation included rotation around the three orthogonal axis with rotation angles randomly extracted from a normal distribution
and random closing and opening morphological operations. All the networks were implemented in Tensorflow and training was stopped after 300k iterations, when the total validation loss function had stopped improving (approximately 42 hours per network on an NVIDIA Tesla K80 GPU), using stochastic gradient descent with momentum (Adam optimizer, learning rate =
) and batch size of 8. During testing, the 3D encoder branch was disabled and the reconstruction were obtained by setting the latent variables .Model  DSC  Hausd. [mm]  MassDiff [%] 
VAE_0  65.48 0.38  9.32 0.06  35.37 0.70 
CVAE_1  78.08 0.33  5.29 0.04  3.94 0.38 
CVAE_23  82.90 0.21  4.43 0.04  3.93 0.19 
CVAE_12  85.21 0.20  4.46 0.04  3.73 0.19 
CVAE_13  83.18 0.18  4.77 0.04  3.69 0.19 
CVAE_123  87.92 0.15  3.99 0.03  2.70 0.14 
pCVAE_123  87.63 0.16  4.04 0.04  3.05 0.16 
TL_net  82.60 0.23  4.66 0.04  3.85 0.19 
AE  90.45 0.12  3.46 0.03  1.50 0.10 
Reconstruction metrics together with their standard error of the mean for all the studied models.
Iii Results and Discussion
Iiia Accuracy of 3D Reconstruction
Table 1 shows the reconstruction accuracy in terms of 3D Dice score, 2D slicebyslice Hausdorff distance and LV mass difference between 3D highresolution segmentations (ground truth and reconstructed ones) for all the studied architectures. LV mass is an important clinical biomarker, therefore we have estimated for each reconstruction its percentage difference in mass with the ground truth. The results indicate that the reconstruction accuracy decreases when views are removed. From the experiments with two views we can also infer how different views have different importance. In particular, the shortaxis view seems to have the smallest impact on the reconstruction accuracy. This could be motivated by the fact that the longaxis views contain more information about the regional changes in curvature of the LV, which strongly influences the Dice Score. The results reported in Table 1 also show how our architecture significantly outperforms the TLnet by a large amount (
), and how the pretraining of the 2D CNN encoder network did not help to achieve better results. Finally, we can observe that the mass difference is systematically overestimated by a small amount that decreases with the number of views provided. We believe this a consequence of using the Dice score in the loss function. On the other hand, models trained using cross entropy in the loss function yielded a systematic underestimation of the mass, often with reconstructions with missing LV apex, as this loss term tends to favour the background instead of the myocardium.IiiB Visualisation and Uncertainty Estimation
In the first and third rows of Fig.2 we report the reconstructed segmentations obtained with one and three views (in red) overlaid onto the ground truth segmentation (in black) for one subject of the testing dataset (with DSC 0.80 and 0.89, respectively). In the second and fourth rows we instead report the confidence maps obtained for the reconstruction with one and three views  and . These maps have been obtained by sampling N times () from to reconstruct N segmentations from the same set of views
. Unlike deterministic architectures (such as the TLnet), by averaging these maps we can compute the probability of each voxel to be labelled as LV myocardium, providing to clinicians a richer and more intuitive interpretation of the reconstruction. It can be seen in Fig.
2 how the confidence map obtained with only 1 view has greater uncertainty than the one obtained with 3 views, which instead shows lower variability. Moreover, the amount of uncertainty in the map for the longaxis view 1 is less than for the other two views, as this view was the one provided to the network as condition. Interestingly, in the reconstruction with one view the areas with more uncertainty correspond to the areas where there is less overlap with the ground truth, i.e. the areas where the network is less accurate in predicting the shape.Iv Conclusions
In this paper we present the first deep conditional generative network for the reconstruction of 3D highresolution LV segmentations from three segmentations of 2D orthogonal views. The reported results show the potential of this class of models to provide better quantitative cardiac models from sparse data. Future work will focus on using real standard longaxis views (instead of the simulated ones in this work), on reconstructing multiple structures and on extending the proposed framework to pathological datasets, for which acquiring breathhold sequences is even more challenging.
Acknowledgments
The research was supported by the British Heart Foundation (NH/17/1/32725, RE/13/4/30184); National Institute for Health Research (NIHR) Biomedical Research Centre based at Imperial College Healthcare NHS Trust and Imperial College London; Academy of Medical Sciences Grant (SGL015/1006), and the Medical Research Council, UK.
References
 [1] K. Alfakih et al., “Assessment of ventricular function and mass by cardiac magnetic resonance imaging,” European radiology, vol. 14, no. 10, pp. 1813–1822, 2004.
 [2] C. Biffi et al., “Threedimensional cardiovascular imaginggenetics: a mass univariate framework,” Bioinformatics, vol. 34, no. 1, pp. 97–103, 2018.
 [3] W. Bai et al., “A biventricular cardiac atlas built from 1000+ high resolution mr images of healthy subjects and an analysis of shape and motion,” Medical image analysis, vol. 26, no. 1, pp. 133–145, 2015.
 [4] T. Whitmarsh et al., “Reconstructing the 3d shape and bone mineral density distribution of the proximal femur from dualenergy xray absorptiometry,” IEEE transactions on medical imaging, vol. 30, no. 12, pp. 2101–2114, 2011.

[5]
R. Girdhar et al.,
“Learning a predictable and generative vector representation for
objects,”
in
European Conference on Computer Vision
. Springer, 2016, pp. 484–499.  [6] J.J. Cerrolaza et al., “3d fetal skull reconstruction from 2dus via deep conditional generative networks,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2018, pp. 383–391.
 [7] C. Biffi et al., “Learning interpretable anatomical features through deep generative models: Application to cardiac remodeling,” in International Conference on Medical Image Computing and ComputerAssisted Intervention. Springer, 2018, pp. 464–471.
 [8] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908, 2016.