1 Introduction
Magnetic Resonance Imaging (MRI) and ultrasound imagery (US) are the most widelyused cardiac image acquisition devices in clinical routine. While MRI can produce highcontrast, highresolution and highSNR images in any orientation, the cardiac function is typically evaluated from a series of kinetic images (cineMRI) acquired in shortaxis orientation of the left ventricle [1]
. In clinical practice, cardiac parameters are usually estimated from the knowledge of the endocardial and epicardial borders of the left ventricle (defining the left cavity (LV) and the myocardium (MYO)) and the endocardial border of the right ventricle (RV) at the enddiastolic (ED) and endsystolic (ES) phases. MRI is the reference exam for the evaluation of the cardiac function and of the cardiac viability after myocardial infarction. Unfortunately, the MRI device is bulky, expensive and cannot be operated by one person even with the latest innovations.
On the other hand, echocardiography is an highly flexible and lowcost exam to evaluate the cardiac function. Ultrasound devices are small and less expensive that one can carry around the hospital. As such, US provides physicians realtime images in an easy way and is often described as the modern stethoscope. Unfortunately, ultrasound images suffer from a poor SNR, noise artifacts, local signal drop, limited field of view, and a limited acquisition angle. The most widelyused acquisition protocol to evaluate the cardiac function involve a 2D+time longaxis orientation resulting into two and fourchamber view images. As for MRI, the endocardial and epicardial borders are outlined at the ED and ES time instant. The volume and ejection fraction of the LV is then computed with the biplane Simpson’s formula [2].
US and MRI are complementary by nature. US devices can quickly evaluate the heart function, find the source of certain symptoms and detect or exclude pathologies. MRI is an imaging modality to further assess a disease and for longitudinal analysis. Both MRI and US are noninvasive and are nonirradiating imaging techniques.
CNNs have had great success at segmenting these modalities [3, 4, 5, 6, 7]. Some neural nets even provide results with overall Dice index and/or Hausdorff distance within the inter and intraobserver variations [4, 5]. Unfortunately, these methods still generate spurious anatomically impossible shapes with holes inside the structures, abnormal concavities, and duplicated regions to name a few. Therefore, despite their excellent results on average, these methods are still unfit for a daytoday clinical use.
To reduce such errors, some authors integrate shape priors to their model [3, 6, 7] while others simply postprocess the generated shapes with morphological operators or some connected component analysis to remove small isolated regions. Unfortunately, none of these approaches can guarantee 100% of the time the anatomical plausibility of their results.
In this paper, we present the first deep learning formalism which guarantees the anatomical plausibility of cardiac shapes, w.r.t. welldefined criteria, under any circumstances. Our method can be plugged at the output of any segmentation method to reduce to zero its number of anatomically invalid shapes, while preserving its overall accuracy. As will be shown in the results section, the same framework is effective for a variety of segmentation methods both applied on echocardiographic and MR images.
2 Previous Work
Although there is more nondeeplearning cardiac segmentation methods than deep learning ones (neural networks are relatively new in the field) we shall focus on the latter due to the very nature of our contribution.
2.1 MRI segmentation
CNNs
The UNet [8] has become the de facto generic encoderdecoder CNN for biomedical image segmentation and is often used in cardiology. Isensee et al. [9], winner of the 2017 MICCAI Automated Cardiac Diagnosis Challenge (ACDC) [4], used an ensemble of 2D and 3D UNet, with the addition of an upscaling and aggregation of the last two convolutional blocks of the decoder for the final segmentation. Also, as mentioned by Bernard et al. [4], several other challengers used a modified version of the UNet. Vigneault et al. proposed a more domain specific approach, OmegaNet [10]
, which has, at its heart, a localization and transformation network that transforms the input MRI into a canonical orientation which is subsequently segmented by a cascade of UNets.
CNNs with shape prior
Although most deep segmentation methods produce accurate segmentation results, they still suffer from anatomical inconsistencies. As a solution, several authors incorporate a shape prior to their model. Oktay et al. uses an approach named anatomically constrained neural network (ACNN) [6]. Their neural network is similar to a 3D UNet whose segmentation output is constrained to be close to a nonlinear compact representation of the underlying anatomy, derived from an autoencoder network. More recently, Zotti et al. proposed a method based on the gridnet architecture that embeds a cardiac shape prior to segment MR images [7]
. Their shape prior encodes the probability of a 3D location point being a member of a certain class and is automatically registered with the last feature maps of their network. Finally, Duan
et al. implemented a shapeconstrained biventricular segmentation strategy [3]. Their pipeline starts with a multitask deep learning approach that aims to locate specific landmarks. These landmarks are then used to initialize atlas propagation during a refinement stage of segmentation. Although the use of an atlas improves the quality of the results, their final segmented shapes strongly depend on the accuracy of the located landmarks. From these studies, it appears that only soft constraints are currently imposed in the literature to steer the segmentation outputs toward a reference shape. As will be shown in this paper, shapeprior methods are not immune to producing anatomically incorrect results.2.2 Echocardiographic segmentation
CNNs
In 2012, Carneiro et al.
exploited deep belief networks and the decoupling of rigid and nonrigid classifiers to improve robustness in terms of image conditions and shape variability
[11]. Later, Chen et al.used transfer learning from cross domain to enhance feature representation
[12]. In parallel, Smistad et al. showed that the UNet [8] could be trained with the output of a stateoftheart deformable model to segment the LV in 2D ultrasound images [13]. Additionally, Leclerc et al.showed that a simple UNet learned from a large annotated dataset can produce accurate results that are much better than the stateoftheart, on average below the interobserver variability and close but still above the intraobserver variability with 18% of outliers
[5]. Recently, the same authors proposed to efficiently integrate the UNet into a multitask network (the socalled ”LUNet”) designed to optimize in parallel a localization and a segmentation procedure [14]. Their results showed that localization allows the introduction of contextualization properties which improve the overall accuracy of cardiac segmentation while reducing the number of outliers to 11%.CNNs with shape prior
The ACNN model proposed by Oktay et al. [6] was also applied to the segmentation of the endocardial border in 3D echocardiography. Results showed that the use of an autoencoder network to impose soft shape constraints allowed to obtain highly competitive scores with respect to the stateoftheart while learning from a limited number of cases (30 annotated volumes). Very recently, Dong et al. developed a deep atlas network to significantly improve 3D LV segmentation based on limited annotation data [15]. The key aspects of this architecture are a lightweight network to perform registration and a multilevel information consistency constraint to enhance the overall model’s performance. This method currently has the best scores for 3D LV segmentation in 3D echocardiography. Jafari et al. also proposed to alter the echocardiography fed to segmentation models using a framework that introduces soft shape priors to CycleGan [16]. By enhancing the quality of the input images through image translation, the authors manage to improve the worstcase performance of standard segmentation networks.
3 Proposed Framework
A schematic representation of our method is given in Fig. 1
. The system is used for both shortaxis MR images and longaxis echocardiographic images, two fairly different looking cardiac shapes. Overall, the system is made of three blocks, namely: 1) a constrained VAE that learns the latent representation of valid cardiac shapes, 2) an anatomicallyconstrained rejection sampling procedure to augment the number of latent vectors and 3) a postprocessing VAE that warps anatomically invalid shapes toward the closest valid ones. Since the system implements a postprocessing for segmentations, the ”Segmentation method” block in Fig.
1 is a placeholder for any possible cardiac segmentation method. The anatomical guarantees come from an operation called ”Latent space transformation” in Fig. 1, that substitutes the latent vector of an incorrect shape by a close but valid one.The correctness of a cardiac shape is determined by a set of complementary anatomical criteria. These criteria allow to identify anatomically implausible configurations regardless of the input image. As such, the aim of our system is to output cardiac shapes that always respect these anatomical criteria.
3.1 Anatomical Criteria
Because of the orientation used to acquire cine MR and apical ultrasound images, our system uses two sets of anatomical criteria, namely the shortaxis and the longaxis criteria (c.f. Fig. 2 and 3 for illustrations). When relevant, thresholds were defined based on the datasets’ training set (ACDC for shortaxis, CAMUS for longaxis) so that no clinically relevant segmentations were marked as invalid. Both datasets cover healthy and pathological cases, so the thresholds take into account a representative distribution of cardiac configurations, and not only a subset of healthy configurations. Since these criteria are not included in the loss, they do not need to be differentiable. They are evaluated systematically on every sample from the latent space, so they do however need to be computable algorithmically using traditional image processing, for efficiency concerns.
ShortAxis Criteria
Our system uses 16 anatomical shortaxis criteria that each highlight an invalid cardiac configuration. These criteria are the following:

(3 criteria) hole(s) in the LV, the RV or the MYO

(2 criteria) hole(s) between the LV and the MYO and between the RV and the MYO

(3 criteria) the presence of more than one LV, RV or MYO

(1 criterion) the RV is disconnected from the MYO

(2 criteria) the LV touches the RV or the background

(3 criteria) the LV, RV and MYO has one (or more) acute concavity

(2 criteria) both for the LV and the MYO, the ratio of their area to that of a circle having the same perimeter (aka circularity metric) exceeds a certain threshold.
LongAxis Criteria
We use 12 anatomical longaxis criteria to highlight invalid configurations. These criteria are:

(3 criteria) hole(s) in the LV, MYO and left atrium (LA)

(2 criteria) hole(s) between the LV and the MYO or between the LV and the LA

(3 criteria) the presence of more than one LV, MYO or LA

(2 criteria) the size of the area by which the LV touches the background or the MYO touches the LA exceeds a certain threshold.

(1 criterion) the ratio between the minimal and maximal thickness of the MYO is below a given threshold

(1 criterion) the ratio between the width of the LV and the average thickness of the MYO exceeds a certain threshold. Both width and thickness are computed as the total width of the structure at the middlepoint of the embeded bounding box. The goal is to identify situations for which the MYO is too thin with respect to the size of the LV.
3.2 Constrained Variational Autoencoder (cVAE)
VAEs [17] are unsupervised neural networks trained to learn the latent representation of a set of data. These neural nets are made of an encoder, which projects an input signal to the latent space, and a decoder, which converts a latent vector back into the input space. More specifically, the VAE encoder outputs the parameters ( and
) of a Gaussian distribution
where is a latent vector ( in our case, and are the parameters of the encoder network). The decoder takes in a latent variable sampled from and outputs , the reconstruction of the input vector . As such, the decoder gets to learn the conditional distribution with as the decoder parameters.In this work, and are 2D cardiac shapes, both . Since our overarching objective is to learn the latent representation of cardiac shapes, we train our VAE with input values that are groundtruth cardiac shapes outlined by a medical expert, and thus without any anatomical aberrations. As such, after the VAE has been trained, Gaussian centroids encoded from groundtruth cardiac shapes will also lead to an anatomically valid reconstructed shape . In fact, any point sampled on the manifold of valid cardiac vectors can be decoded to an anatomically valid cardiac shape . As such, we call these vectors valid latent vectors.
However, as will be shown later, our method needs to linearly interpolate latent vectors. It follows that a latent vector
interpolated between two anatomically valid vectors should also be valid (at least most of the time). Furthermore, our method needs that a small translation performed on a valid latent vector leads to a smooth and anatomically coherent transformation on the resulting decoded image.These constraints can be fulfilled with a linear manifold that we approach with a constrained VAE (cVAE) [18]
. The constraint comes in the form of a singleneuron regression network
[19] trained simultaneously with the encoder and the decoder (c.f. Fig. 1). The goal of the linear regression network is to reproduce a domainspecific target
associated to the input image . Since a singleneuron network with no activation can only learn a linear function, the gradient from the regression loss forces the encoder to learn a more linear (and thus less convoluted) manifold of valid shapes in the latent space.The resulting loss function of our cVAE is:
(1)  
where the first two terms make up the usual ELBO (Evidence Lower BOund) VAE loss function [17], with
as the unitvariance zeromean Gaussian prior. The last term is the L2 regression loss of the oneneuron net.
MRI shortaxis linear constraint
Since cineMR shortaxis images are 2D+time arrays stacked into two 3D volumes, in our study, only the ES and ED phases are considered and then the target predicted by the oneneuron regression network is the slice index of normalized between 0 (base) and 1 (apex).
Ultrasound longaxis linear constraint
The ultrasound signal is a 2D+time sequence of images. In this case, the regression network is designed to predict the time instant of the input image . Here as well, the target value is normalized between and , where stands for the enddiastolic time instant and , the endsystolic time instant.
3.3 AnatomicallyConstrained Data Augmentation
As mentioned before, once the cVAE is trained, the 2D groundtruth cardiac shapes can be projected in the 32D latent space, where they form a manifold of valid latent vectors. These latent vectors are ”anatomically correct”, since the deterministic cVAE decoder can convert them back to anatomically valid cardiac shapes.
The idea behind our method is to warp invalid cardiac shapes toward a close but valid configuration. This is done by projecting any invalid cardiac shape to the latent space, project its associated invalid latent vector to the closest point on the manifold of valid latent vectors, and then decode the resulting vector. Unfortunately, with 32 dimensions, the latent space has a whopping number of quadrants, which is orders of magnitude larger than any annotated cardiac dataset. As such, with too few valid latent vectors, the manifold is too sparse to be effective.
One solution to that problem is to increase the number of valid latent vectors through data augmentation. Since the manifold in the latent space is roughly linear, one can easily sample it with a rejection sampling (RS) method [20]. The goal is to generate a new set of latent vectors such that the distribution of these newly generated samples is close to , the distribution from which the original valid latent vectors are identically independent and identically distributed (iid) from. Since sampling
directly is difficult, RS samples a second, and yet easier, probability density function
. A common choice for is a Gaussian of mean and variance equal to the distribution of the original valid latent vector derived from the groundtruth segmentation. A key idea with RS is that where . Given and , the sampling procedure first generates a random sample , iid of , as well as a uniform random value . If then is kept, otherwise it is rejected. Since in our case is unknown a priori, we estimate it with a Parzen window distribution [19].The primary objective with RS is to increase the number of latent vectors. However, since these newly generated points need to lie on the manifold of valid vectors, we want those new vectors to correspond to anatomically valid cardiac shapes. As such, we redefine the RS criterion as follows:
(2) 
where is the VAE decoder that converts the latent vector into a segmentation map and is an indicator function which returns 1 when the input segmentation map respects the defined anatomical criteria and zero otherwise. In Fig. 1, this operation is called anatomicallyconstrained rejection sampling augmentation. This sampling procedure is repeated up until the desired number of samples is reached. At the end, a total number of 4 million latent vectors have been generated, both for the MRI and the ultrasound datasets. Each of these vectors have a corresponding valid cardiac shape that respects the aforementioned anatomical criteria (c.f. Section 3.1). Samples of cardiac shapes generated with anatomicallyconstrained rejection sampling augmentation are provided in Fig. 3.
3.4 Cardiac shape warping
Our system can be seen as a postprocessing operator that one can plug after any segmentation method that sometimes generates anatomically erroneous segmentation maps. This is illustrated at the bottom right of Fig. 1, where a VAE is used to convert erroneous segmentation maps into anatomically valid segmentations. This postprocessing VAE is in fact the trained cVAE. Thus, any anatomically invalid segmentation map fed to the VAE encoder gets projected into the latent space where 4 million valid vectors lie. Furthermore, since the VAE decoder is deterministic, any anatomically valid latent vector is guaranteed to be converted into an anatomically plausible cardiac shape.
As mentioned before, our aim is to warp an anatomically incorrect cardiac shape toward a close but correct configuration. We do so by translating the latent vector of an erroneous cardiac shape to a near but anatomically valid latent vector . This operation can be summarized as:
(3) 
The result of this optimization is a valid latent vector that is the closest to . However, since involves nondifferentiable anatomical criteria, the optimization formulation of Eq. (3) cannot be solved with a usual Lagrangian solution. An alternative solution is to redefine the problem of finding as the problem of finding the smallest vector such that with . In our case, we recover based on the nearest neighbor of in the augmented latent space, i.e. where corresponds to the nearest latent vector. This leads to an easier 1D optimization problem:
(4) 
that we solve with a dichotomic search. Starting with , at each iteration, the anatomical criterion dictates which half of the search space should be explored further: lower values of if , and higher values of if .
Since the dichotomic search reduces the search space exponentially fast, the optimization algorithm is stopped after five iterations. At the end, the selected is the smallest that validates the anatomical criterion.
3.5 Robust VAE
Current limitations of the proposed method are the need for millions of latent vectors to be stored in memory and the nearest neighbor search to perform each time a segmentation result is anatomically flawed. In this section, we present an alternative method that does not require the storage of latent vectors, nor the search for nearest neighbors. This method allows for faster processing and reduced memory usage, but without the previous method’s anatomical guarantees. The use of either method depends on the application at hand.
Instead of using the postprocessing VAE with cardiac shape warping as in Fig. 1, we implemented a robust VAE (rVAE). The goal of this new VAE is to directly convert erroneous segmentation maps into anatomically plausible configurations
. To do so, we added a step to the VAE training procedure. Starting with a pretrained cVAE, we fixed the weights of the decoder and of the singleneuron regression network, and finetuned the cVAE like a denoising autoencoder, i.e. by feeding it with anatomically implausible maps and training it to reproduce valid segmentations (c.f. bottom of Fig.
4). Since the decoder is fixed, this forces the new encoder of the cVAE () to learn to project erroneous segmentation maps close to their corresponding valid latent vectors.In practice, we generated a synthetic training set of 10,000 pairs of anatomically valid and invalid cardiac shapes, using the generative capabilities of the cVAE. As shown at the top of Fig. 4, we added some noise to the latent vectors obtained from the training data, and decoded the resulting vectors. More precisely, the valid latent vector of an input image is shifted along the axis defined by the singleneuron regression network parameters to obtain the noisy latent vector . This warped latent vector is decoded to produce a segmentation map . Because of the linear constraint, the distribution is stretched along a plane perpendicular to the the axis defined by the singleneuron regression network, to allow for a linear separation of the domainspecifc target. At equal magnitude, warping the latent vector along the normal of the plane that defines the stretch of the distribution is more likely to produce samples out of distribution than along any other direction in the latent space. Since outofdistribution samples are more likely to be decoded into implausible segmentation maps, this perturbation of the latent vector is a suitable way to obtain an artificial anatomically invalid cardiac shape paired with the original valid cardiac shape.
The rVAE is trained to recreate from . An additional constraint is used to incite the encoder to project erroneous segmentations close to their corresponding valid latent vectors. This constraint is implemented as an additional KL loss term. This KL loss term minimizes the distance between the latent vector obtained by the rVAE on the noisy data and the original latent vector generated by the cVAE on the clean training data.
For a given , and the loss function is:
(5)  
Dataset  AE  VAE  

  Registered  Const.  Reg. + const.  
ACDC  64.76  5.84  5.85  8.48  1.25 
CAMUS  41.25  1.48  0.52  3.32  0.12 
3.6 Implementation Details
The encoder of our cVAE is made up of 4 convolutional blocks, followed by two fullyconnected heads that output the and parameters of the posterior distribution. Each convolutional block consists of two convolutional layers with ELU [21]
activations: the first one with stride 2 (to downsample by half in lieu of pooling), and the second with stride 1 and same padding. The dimensionality of the latent space was fixed at 32, to remain as low as possible while allowing for high reconstruction accuracy.
The decoder follows a similar structure, first using a fullyconnected layer to project to the same volume as the output of the last convolutional block in the encoder. After the FC layer comes a 4block structure mirroring the encoder. Each block now consists of 2 layers with ELU [21] activations: the first one is a transposed convolution with stride 2 (to upsample by 2), and the second one is a convolution with stride 1 and same padding. A final convolution layer with stride 1 and same padding outputs the pixelwise score for each class.
The number of feature maps is set to 48 for the first layer, and doubles at each successive block in the encoder. It follows the reverse logic in the decoder, where it is reduced by half in each block in order to reach 48 just before the final convolution with softmax. The encoder and decoder are trained endtoend with the Adam optimizer [22], using a learning rate of for ACDC and for CAMUS. In both cases, a weight regularization with was applied.
The AE mentioned in the ablation study of Table 1
uses the exact same architecture and hyperparameter values, except for one adaptation. At the end of the encoder, a single fullyconnected head is used to directly obtain the latent vector, instead of the parameters of the posterior distribution.
The segmentations maps were resized to and registered. In the case of ACDC, the registration process implied centering image on the LV, and aligning the LV and RV on an horizontal line (i.e. aligning according to the centers of the cavities). With CAMUS, registering meant centering the image on the union of the LV and MYO, and vertically aligning the principal axis of the LV. During inference, the registration is based on the results of the segmentation method rather than the groundtruth. Because this is done prior to any of our postprocessing, our method is dependent on the original segmentation being at least somewhat accurate w.r.t. the position and orientation of the heart.
Methods  Original  VAE  Nearest Neighbors  

  Robust  w/o RS  w/ RS  Dicho  
Zotti2 [7]  55  16  7  0  0  0 
Khened [23]  55  16  9  0  0  0 
Baumgartner [24]  79  17  8  0  0  0 
Zotti [25]  82  15  7  0  0  0 
Grinias [26]  89  12  6  0  0  0 
Isensee [9]  128  21  7  0  0  0 
Rohé [27]  287  40  21  0  0  0 
Wolterink [28]  324  42  16  0  0  0 
Jain [29]  185  28  17  0  0  0 
Yang [30]  572  182  137  0  0  0 
ACNN [6]  139  41  21  0  0  0 
4 Experimental Setup and Results
4.1 Datasets, evaluation criteria, and other methods
4.1.1 MRI dataset
The MRI dataset is the 2017 ACDC dataset [4], which contains shortaxis cineMR images of 150 patients: 100 for training and 50 for testing. Particularly, a series of short axis slices cover the LV from the base to the apex, with one image every 5 or 10 mm, according to the examination. The spatial resolution goes from 1.37 to 1.68 mm^{2}/pixel and 28 to 40 images cover the cardiac cycle. The enddiastolic and endsystolic phases were visually selected. As shown in Fig. 6(a), the LV, RV and MYO of every patient has been manually segmented. We report the average Hausdorff distance (HD) and 3D Dice index for the LV, RV and MYO as well as the LV and RV ejection fraction (EF) absolute error. Since our approach can postprocess any segmentation method, we tested it on the test results reported by ten ACDC challengers. Their methods are summarized by Bernard et al. [4] except for Zotti2 [7] whose results have been uploaded after the challenge. We also report results for the ACNN method of Oktay et al. [6] that uses a latent anatomical prior together with their segmentation CNN. Results from our best ACNN implementation (which involves a UNet and our VAE) are very close to that of the original paper, despite the fact that the ACDC training set is smaller than in the original paper [6]. HD values are also slightly larger since we use a 3D HD instead of a 2D HD as in the original paper.
4.1.2 Echocardiographic dataset
The CAMUS dataset [5]
consists of conventional clinical exams from 500 patients acquired with a GE Vivid E95 ultrasound scanner. The acquisitions were optimized to perform measurements of the left ventricular ejection fraction. For each patient, 2D apical fourchamber and twochamber view sequences were acquired with the same acquisition protocol and exported from EchoPAC analysis software (GE Vingmed Ultrasound, Horten, Norway). The corresponding videos are expressed in native polar coordinates. The same resampling scheme was applied on each sequence to express the corresponding images into a cartesian coordinate system with a constant grid resolution of
(i.e. 0.31 mm) in the lateral direction and (i.e. 0.15 mm) in the axial direction, where corresponds to the wavelength of the ultrasound probe. The dataset is divided in 10 folds of equal size, nine of which are used for training and one for testing. The image quality (poor, good, and medium) and ejection fraction ( 45%,55% or in between) are uniformly distributed across every fold. A senior cardiologist manually annotated the endocardium and epicardium borders of the left ventricle as well as the atrium of the enddiastolic (ED) and endsystolic (ES) images of every patient.
We tested our framework on the output of 7 methods: four conv nets (UNet [8, 5], LUNet, ENet [31] and SHG [32]) and three nondeep learning methods (SRF [33], BEASMauto [34, 35], and BEASMsemi [34, 5]). Note the nondeeplearning methods were stateoftheart up until 2017.
Methods  Original  VAE  Nearest Neighbors  

  Robust  w/o RS  w/ RS  Dicho  
Zotti2 [7]  .913 / 9.7  .910 / 10.1  .910 / 11.3  .899 / 14.4  .909 / 11.0  .910 / 10.1 
Khened [23]  .915 / 11.3  .912 / 12.3  .912 / 11.8  .894 / 15.2  .909 / 12.7  .912 / 10.9 
Baumgartner [24]  .914 / 10.5  .911 / 11.2  .912 / 10.8  .889 / 18.2  .907 / 12.6  .910 / 10.6 
Zotti [25]  .910 / 9.7  .907 / 10.9  .907 / 11.3  .878 / 19.6  .903 / 12.6  .907 / 11.0 
Grinias [26]  .835 / 15.9  .833 / 19.3  .834 / 15.7  .752 / 32.5  .825 / 16.9  .833 / 15.8 
Isensee [9]  .926 / 9.1  .923 / 10.7  .923 / 9.7  .881 / 18.4  .917 / 11.2  .923 / 9.2 
Rohé [27]  .891 / 12.2  .887 / 14.6  .886 / 16.3  .756 / 32.2  .874 / 15.1  .887 / 12.8 
Wolterink [28]  .907 / 10.8  .903 / 13.0  .902 / 11.6  .752 / 32.8  .887 / 13.5  .903 / 11.0 
Jain [29]  .891 / 12.2  .886 / 12.6  .885 / 13.0  .820 / 31.9  .878 / 14.2  .886 / 11.6 
Yang [30]  .800 / 27.5  .752 / 21.7  .742 / 24.3  .455 / 29.7  .722 / 11.5  .752 / 10.2 
ACNN [6]  .892 / 12.3  .886 / 26.2  .885 / 21.6  .885 / 12.0  .885 / 12.2  .889 / 13.1 
Zotti2 [7]  2.54 / 5.11  2.63 / 5.12  2.59 / 5.08  2.49 / 5.57  2.58 / 5.18  2.62 / 5.18 
Khened [23]  2.39 / 5.24  2.41 / 4.96  2.43 / 5.18  2.70 / 5.36  2.63 / 5.07  2.42 / 5.27 
Baumgartner [24]  2.58 / 6.00  2.62 / 6.30  2.54 / 6.18  2.83 / 6.72  2.85 / 6.48  2.64 / 6.33 
Zotti [25]  2.98 / 5.48  2.98 / 5.42  3.04 / 5.57  3.06 / 5.72  3.10 / 5.71  3.06 / 5.59 
Grinias [26]  4.14 / 7.39  4.18 / 7.86  3.94 / 7.59  4.67 / 8.00  4.33 / 7.35  4.01 / 7.43 
Isensee [9]  2.16 / 4.85  2.15 / 4.61  2.18 / 4.85  2.49 / 5.58  2.35 / 4.48  2.20 / 4.82 
Rohé [27]  2.84 / 8.18  2.95 / 7.85  2.85 / 8.34  3.13 / 8.93  3.39 / 7.97  2.91 / 8.11 
Wolterink [28]  2.75 / 6.59  2.82 / 6.39  2.83 / 6.42  3.40 / 6.93  3.48 / 6.07  2.84 / 6.44 
Jain [29]  4.36 / 8.49  4.35 / 8.83  4.46 / 9.09  4.98 / 9.63  4.59 / 8.69  4.40 / 8.72 
Yang [30]  6.22 / 15.99  6.80 / 20.56  5.40 / 21.58  7.57 / 27.9  7.77 / 22.09  9.10 / 21.76 
ACNN [6]  2.46 / 3.68  2.53 / 4.09  2.59 / 4.05  2.51 / 3.89  2.96 / 3.82  2.50 / 3.71 
4.2 Experimental Results
4.2.1 Constrained variational autoencoder
We gauged the linearity property of the latent space generated by our cVAE through the ablation study in Table 1. Since our postprocessing method relies on latent vector interpolation (c.f. Eq (4)), we computed the percentage of anatomically incorrect results obtained after interpolating a series of two valid latent vectors chosen at random. To do so, we iteratively selected two random groundtruth images from two random patients, projected it to the latent space with the cVAE encoder and linearly interpolated 25 new latent vectors. We then converted these 25 vectors back into the image space with the cVAE decoder and computed their percentage of anatomical errors. This procedure is illustrated in Fig. 5.
We repeated that process 300 times, i.e. combinations between 25 random vectors, (both for Camus and ACDC) for the cVAE with and without registration and with and without the oneneuron regression net. We first tested our full method (i.e. with image registration and a L2 regression constraint), then removed the image registration but kept the regression constraint, then removed the regression and kept the image registration and finally only used the VAE without registration nor regression. As shown in Table 1, the combination of image registration and regression constraint reduces the percentage of anatomically implausible results down to for ACDC and a negligeable for CAMUS, which is more than 4x lower than for any other configuration. As for a simple autoencoder (c.f the AE column), since it provides no constraint on the latent space whatsoever, the percentage of errors is orders of magnitude larger.
4.2.2 ACDC Postprocessing results
Results on the ACDC test set are in Table 2 and Table 3. Table 2 contains the total number of slices with at least one anatomical error for 11 different methods, the first ten being official ACDC challengers. Results without our postprocessing are under the Original column. As for Table 3, we report for the same methods their associated Dice index and Hausdorff distance (HD) [top] as well as their LV and RV ejection fraction absolute error [bottom].
As can be seen from the ”VAE” column in Table 2, feeding every erroneous segmentation map to our VAE without transforming the latent vector , significantly reduces the number of anatomical errors, without affecting too much the average clinical metrics (Table 3). This comes as no surprise, since the VAE was trained to output similar anatomically correct cardiac shapes (the ACDC test set has a total of slices). The rVAE further reduces, by a factor of almost 2, the number of anatomical errors, without significantly impacting the overall anatomical metrics. With a processing time 10 times faster than our most accurate method, the rVAE can be seen as a good compromise for realtime applications.
However, like any neural network, a VAE (be it robust or not) comes with no guarantee on the validity of its output. To completely eliminate erroneous segmentations, we tested three variants of our method. At first, we swap erroneous latent vectors with their nearest neighbor (thus forcing in Eq. 4) without and with rejection sampling (cf. columns ”w/o RS” and ”w/ RS”). As mentioned in Section 3.3, we increased to 4 million the number of anatomically correct latent vectors with the rejection sampling. Despite the fact that both methods reduce to zero the number of anatomical errors, we can see from Table 3 that data augmentation systematically produces better results. Also, while the improvements are incremental for top performing methods (e.g. the Dice index of Zotti2 went from .899 to .909 and its HD from 14.4 to 11.0), they are drastic for methods with a large number of anatomical errors (e.g. Wolterink saw its Dice index go from .752 to .887 and its HD from 32.2 to 13.5). We can thus conclude that our method without a data augmented latent space could hurt the overall accuracy of certain methods.
Methods  Original  VAE  Nearest Neighbors  

  Robust  w/o RS  w/ RS  Dicho  
UNet [8, 5]  84  16  14  0  0  0 
LUNet [14]  25  11  6  0  0  0 
ENet [31]  69  21  22  0  0  0 
SHG [32]  38  5  5  0  0  0 
SRF [33]  101  46  48  1  2  2 
BEASMauto [34, 35]  12  2  3  0  0  0 
BEASMsemi [34, 5]  10  4  7  0  0  0 
The last column of Tables 2 and 3 shows the results of our complete method, i.e. Eq.( 4) optimized with a dichotomic search on a data augmented latent space. While all results respect the anatomical criteria, the EF error and the Dice index are almost identical to that of the original methods. The HD also never increases more than 1.3 mm. Considering that the average voxel size is near 1.4x1.4x10 mm^{3}, the increase corresponds to less than 1 pixel in the image. This shows that our approach does not degrade the overall results of a given approach.
Fig. 6 (a) shows erroneous predictions before and after our postprocessing. While the correct areas are barely affected by our method, erroneous sections, big or small, get smoothly warped. Our method takes roughly 1 sec to process a 2D image on a midend computer equipped with a Titan X GPU.
Methods  Original  VAE  Nearest Neighbors  

  Robust  w/o RS  w/ RS  Dicho  
UNet [8, 5]  .921 / 6.0  .923 / 5.7  .923 / 5.7  .922 / 5.7  .922 / 5.7  .923 / 5.7 
LUNet [14]  .922 / 5.9  .921 / 5.9  .922 / 5.9  .921 / 5.9  .921 / 6.0  .921 / 6.0 
ENet [31]  .923 / 5.8  .921 / 5.9  .921 / 5.9  .920 / 5.9  .920 / 5.9  .921 / 5.9 
SHG [32]  .915 / 6.2  .915 / 6.2  .916 / 6.2  .915 / 6.2  .915 / 6.2  .915 / 6.2 
SRF [33]  .879 / 13.1  .877 / 13.2  .878 / 13.2  .879 / 13.0  .879 / 13.0  .879 / 13.0 
BEASMauto [34, 35]  .868 / 10.5  .868 / 10.5  .867 / 10.5  .868 / 10.5  .868 / 10.5  .868 / 10.5 
BEASMsemi [34, 5]  .899 / 7.8  .899 / 7.8  .899 / 7.8  .899 / 7.8  .899 / 7.8  .899 / 7.8 
UNet [8, 5]  5.4  5.6  5.6  5.9  5.9  5.7 
LUNet [14]  5.1  5.1  5.1  5.4  5.2  5.2 
ENet [31]  5.6  5.4  5.4  5.5  5.6  5.4 
SHG [32]  5.8  5.9  5.9  6.1  6.1  6.0 
SRF [33]  12.7  14.5  14.3  14.4  14.4  14.3 
BEASMauto [34, 35]  10.5  10.5  10.5  10.6  10.5  10.5 
BEASMsemi [34, 5]  9.8  9.8  9.8  9.8  9.8  9.8 
4.2.3 CAMUS Postprocessing results
We perform a similar set of experiments on the CAMUS dataset. Results are reported in Table 4 (number of anatomically invalid slices) and Table 5 (clinical metrics). As for ACDC, the use of a simple VAE significantly reduces the number of anatomical errors, without affecting too much the average Dice index, HD and EF absolute error. However, unlike for ACDC, the robust VAE did not succeed at further reducing errors, especially for the nondeeplearning methods. This may be explained by the fact that the number of anatomical errors are already low with a basic VAE.
Another difference with ACDC is the results for our three nearest neighbors methods. While they reduce to zero the number of anatomical errors, all three methods have almost the same Dice index, HD and anatomical errors. This can be explained by the fact that the longaxis cardiac shapes are roughly similar from one patient to another, regardless of the time instant (c.f. Fig. 5). This is unlike the shortaxis view, where the shape varies greatly between the basis of the heart down to the apex. As such, the longaxis valid latent vectors are probably closer together, so a simple nearest neighbor swap is enough to enforce our anatomical criteria on the output while preserving the overall anatomical shape.
However, as can be seen from Table 5, like for ACDC, our method does not degrade by a significant manner the anatomical nor the clinical metrics.
4.2.4 Interobserver variability
The interobserver variability of cardiac MRI and echocardiographic image segmentation was reported by Bernard et al.[4] and Leclerc et al.[5]. For MRI segmentation, on average the interobserver Dice score for the LV, the RV and the MYO at the endsystolic and enddiastolic time instant is while the average Hausdorff distance is mm [4] . As can be seen from Table 3, the methods with a dice score above (column ) are also above after our processing (column ). Also, the only method with a Hausdorff distance below mm is that of Isensee [9], which is also below after our processing.
As for the echocardiographic segmentation, the average interobserver Dice score reported in [5] is and the average Hausdorff distance is mm. Again, as can be seen from the column of Table 5, the first four methods are within the observervariability, and still are after our postprocessing (column ).
This reveals that while our method guarantees to produce results that follow predefined anatomical guidelines, it does not degrade the overall accuracy of highly effective methods.
4.2.5 Postprocesssing degenerated results
Our method has its own limits and cannot be regarded as a solution to every harm. While our method guarantees the anatomical validity, w.r.t. the hardcoded criteria, of the output, it by no means guarantees that the produced output is close to the groundtruth. As such, if the erroneous segmentation map it has to correct has little to no overlap with the groundtruth, our method will not necessarily warp in the direction of the groundtruth. It will only warp to the closest correct cardiac shape. Three such examples are provided in Fig. 7 where the result of our method is not closer to the groundtruth than . In fact, the cardiac shape of Fig. 7(c) is so degenerated that the produced output is perpendicular to the groundtruth (because the inverse registering operation is based on the principal axis of the LV, which in this case is horizontal). Also, despite the fact that the produced shape is anatomically valid, the segmentation is sideways, causing the computation for the LV’s width and MYO’s thickness (c.f. criterion 6 in sec. 3.1) to be inaccurate and to detect an anomaly, hence the 1 and 2 errors reported for the SRF method in Table 4. This particular example also illustrates what can happen when the original segmentation is so bad that even the inferencetime registering is inaccurate, as mentioned at the end of sec. 3.6. That said, even for inaccurate segmentation methods (e.g. Grinias and Yang in Table 3), our method does not worsen their overall scores. Metrics obtained solely on the anatomically incorrect images are provided in the supplementary materials and also show that our method does not reduce the overall metrics.
5 Conclusion
We proposed a postprocessing cVAE that converts invalid cardiac shapes into close but correct shapes. This is done by replacing the latent vector of an invalid shape by a close but valid latent vector. Intensive tests performed on the output of 18 segmentation methods reveal that our method is effective on both shortaxis views from MRI as well as on longaxis views from US. Our method relies on a series of anatomical criteria (16 for SA and 12 for LA) that we use both to detect abnormalities and populate a cVAE latent space. One appealing feature of the proposed framework is that anatomical criteria do not need to be differentiable as they are not included in the loss. Furthermore, it has been shown that the warping of the incorrect segmentation shapes did not change significantly the overall geometrical metrics (Dice index and Hausdorff) nor the clinical metrics (the RV and LV ejection fraction). As such, according to the inter and intraexpert variations reported by Bernard et al.[4] and Leclerc et al.[5], methods such as Isensee, Zotti2, Khened and Baumgartner for ACDC and LUNet for CAMUS are within the interexpert variation and, with our method, are now guaranteed to produce results that follow anatomical guidelines defined by the user. From the point of view of a clinical expert, it is preferable to have a plausible segmentation close to the expected one than an efficient system that spuriously provide aberrant segmentations. In that case, users cannot trust the provided physiological parameters that is calculated from these latest data, even if implausible segmentations do not significantly change the parameter values.
References
 [1] S. Michael, S. Behzad, A. Håkan, K. Andreas, A. Leon, L. Debiao, and N. Stefan, “Recent advances in cardiovascular magnetic resonance: Techniques and applications.” Circ: Card. Img, vol. 10, no. 6, 2017.
 [2] E. Folland, A. Parisi, P. Moynihan, D. Jones, C. Feldman, and D. Tow, “Assessment of left ventricular ejection fraction and volumes by realtime, twodimensional echocardiography. a comparison of cineangiographic and radionuclide techniques,” Circul., vol. 60, pp. 760–6, 1979.
 [3] J. Duan, G. Bello, J. Schlemper, W. Bai, T. J. W. Dawes, C. Biffi, A. de Marvao, G. Doumou, D. P. O’Regan, and D. Rueckert, “Automatic 3d biventricular segmentation of cardiac images by a shaperefined multitask deep learning approach,” IEEE TMI, vol. 38, no. 9, pp. 2151–2164, 2019.
 [4] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.A. Heng, I. Cetin, K. Lekadir, O. Camara, M. . G. Ballester, G. Sanroma, S. Napel, S. Petersen, G. Tziritas, G. Ilias, M. Khened, V. A. Kollerathu, G. Krishnamurthi, M.M. Rohe, and S. Engelhardt, “Deep learning techniques for automatic mri cardiac multistructures segmentation and diagnosis: Is the problem solved?” IEEE TMI, vol. 37, no. 11, pp. 2514–2525, 2018.
 [5] S. Leclerc, E. Smistad, J. Pedrosa, A. Østvik, F. E. and. T. Espeland, E. R. Berg, P.M. Jodoin, T. Grenier, C. Lartizien, J. D’hooge, L. Lovstakken, and O. Bernard, “Deep convolutional network for 2d echocardiographic segmentation based on an open largescale patient database,” IEEE TMI, vol. 38, no. 8, pp. 2198–2210, 2019.
 [6] O. Oktay, E. Ferrante, K. Kamnitsas, M. Heinrich, W. Bai, J. Caballero, S. Cook, A. de Marvao, T. Dawes, D. O’Regan, B. Kainz, B. Glocker, and D. Rueckert, “Anatomically constrained neural networks (acnns): Application to cardiac image enhancement and segmentation,” IEEE TMI, vol. 37, no. 2, 2017.
 [7] C. Zotti, Z. Luo, A. Lalande, and P.M. Jodoin, “Convolutional neural network with shape prior applied to cardiac mri segmentation,” IEEE JBHI, vol. 23, no. 3, pp. 1119–1128, 2019.
 [8] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015, pp. 234–241.
 [9] F. Isensee, P. F. Jaeger, P. M. Full, I. Wolf, S. Engelhardt, and K. H. MaierHein, “Automatic cardiac disease assessment on cinemri via timeseries segmentation and domain specific features,” in STACOMMICCAI, 2017, pp. 120–129.
 [10] D. M. Vigneault, W. Xie, C. Y. Ho, D. A. Bluemke, and J. A. Noble, “net (omeganet): fully automatic, multiview cardiac mr detection, orientation, and segmentation with deep neural networks,” Medical image analysis, vol. 48, pp. 95–106, 2018.
 [11] G. Carneiro, J. C. Nascimento, and A. Freitas, “The Segmentation of the Left Ventricle of the Heart From Ultrasound Data Using Deep Learning Architectures and DerivativeBased Search Methods,” IEEE T. Image Process., vol. 21, no. 3, pp. 968–982, 2012.
 [12] H. Chen, Y. Zheng, J. H. Park, P.A. Heng, and S. K. Zhou, “Iterative multidomain regularized deep learning for anatomical structure detection and segmentation from ultrasound images,” in MICCAI, vol. 9901, 2016, pp. 487–495.
 [13] E. Smistad, A. Østvik, B. O. Haugen, and L. Lovstakken, “2D left ventricle segmentation using deep learning,” in IEEE IUS, 2017, pp. 1–4.
 [14] S. Leclerc, E. Smistad, A. Østvik, F. Cervenansky, F. Espinosa, T. Espeland, E. A. R. Berg, T. Grenier, C. Lartizien, P.M. Jodoin, L. Lovstakken, and O. Bernard, “Lunet: a multitask network to improve the robustness of segmentation of left ventriclular structures by deep learning in 2d echocardiography,” 2020.
 [15] S. Dong, G. Luo, C. Tam, W. Wang, K. Wang, S. Cao, B. Chen, H. Zhang, and S. Li, “Deep atlas network for efficient 3d left ventricle segmentation on echocardiography,” Medical Image Analysis, vol. 61, p. 101638, 2020.
 [16] M. H. Jafari, Z. Liao, H. Girgis, M. Pesteie, R. Rohling, K. Gin, T. Tsang, and P. Abolmaesumi, “Echocardiography Segmentation by Quality Translation Using Anatomically Constrained CycleGAN,” in MICCAI, 2019.
 [17] D. Kingma and M. Welling, “Autoencoding variational bayes,” in ICLR, 2013.
 [18] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed, and A. Lerchner, “betavae: Learning basic visual concepts with a constrained variational framework,” in ICLR, 2017.
 [19] C. M. Bishop, Pattern recognition and machine learning, 5th Ed. Springer, 2007.
 [20] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
 [21] D.A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” in ICLR, 2016.
 [22] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.

[23]
M. Khened, V. Alex, and G. Krishnamurthi, “Densely connected fully convolutional network for shortaxis cardiac cine mr image segmentation and heart diagnosis using random forest,” in
STACOMMICCAI, 2017, pp. 140–151.  [24] C. F. Baumgartner, L. M. Koch, M. Pollefeys, and E. Konukoglu, “An exploration of 2d and 3d deep learning techniques for cardiac mr image segmentation,” in STACOMMICCAI, 2017, pp. 111–119.
 [25] C. Zotti, Z. Luo, O. Humbert, A. Lalande, and P.M. Jodoin, “Gridnet with automatic shape prior registration for automatic mri cardiac segmentation,” in STACOMMICCAI, 2017, pp. 73–81.
 [26] E. Grinias and G. Tziritas, “Fast fullyautomatic cardiac segmentation in mri using mrf model optimization, substructures tracking and bspline smoothing,” in STACOMMICCAI, 2017, pp. 91–100.
 [27] M.M. Rohé, M. Sermesant, and X. Pennec, “Automatic multiatlas segmentation of myocardium with svfnet,” in STACOMMICCAI, 2017, pp. 170–177.
 [28] J. M. Wolterink, T. Leiner, M. A. Viergever, and I. Išgum, “Automatic segmentation and disease classification using cardiac cine mr images,” in STACOMMICCAI, 2017, pp. 101–110.
 [29] J. Patravali, S. Jain, and S. Chilamkurthy, “2d3d fully convolutional neural networks for cardiac mr segmentation,” in STACOMMICCAI, 2017, pp. 130–139.
 [30] Y. Jang, Y. Hong, S. Ha, S. Kim, and H.J. Chang, “Automatic segmentation of lv and rv in cardiac mri,” in STACOMMICCAI, 2017, pp. 161–169.
 [31] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for realtime semantic segmentation,” 2016.

[32]
A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in
ECCV, 2016.  [33] P. Dollár and C. L. Zitnick, “Fast edge detection using structured forests,” IEEE T. PAMI, vol. 37, no. 8, pp. 1558–1570, 2015.
 [34] J. Pedrosa, S. Queirós, O. Bernard, J. Engvall, T. Edvardsen, E. Nagel, and J. D’hooge, “Fast and fully automatic left ventricular segmentation and tracking in echocardiography using shapebased bspline explicit active surfaces,” IEEE TMI, vol. 36, no. 11, pp. 2287–2296, 2017.
 [35] D. Barbosa, T. Dietenbeck, B. Heyde, H. Houle, D. Friboulet, J. D’hooge, and O. Bernard, “Fast and fully automatic 3d echocardiographic segmentation using bspline explicit active surfaces: Feasibility study and validation in a clinical setting,” Ultrasound in Medicine & Biology, vol. 39, no. 01, pp. 89–101, 2013.
Comments
There are no comments yet.