## 1 Introduction

Magnetic Resonance Imaging (MRI) is a non-invasive imaging technique of choice to evaluate the heart. The cardiac function is typically evaluated from a series of kinetic images (cine-MRI) acquired in short-axis orientation [9]

. In clinical practice, cardiac parameters are usually estimated from the knowledge of the endocardial and epicardial borders of the left ventricle (defined as the cavity (LV) and the myocardium (MYO)) and the endocardial border of the right ventricle (RV) in end-diastolic (ED) and end-systolic (ES) phases. In the last few years, several deep learning segmentation methods (in particular CNNs) have had great success at estimating these clinical parameters

[3, 4, 5, 11]. Some of them provide excellent segmentation results with overall Dice index and/or Hausdorff distance within the inter- and intra-observer variations [4]. Unfortunately, these methods still generate anatomically impossible shapes like a LV connected to the background or two disconnected RV regions. Therefore, despite their excellent results on average, these methods are still unfit for day-to-day clinical use.To reduce such errors, several papers integrate shape priors into their cardiac deep learning segmentation methods. In particular, Oktay *et al.* used an approach named anatomically constrained neural network (ACNN) [5]. Their neural network is similar to a 3D U-Net, whose segmentation output is constrained to be close to a non-linear compact representation of the underlying anatomy derived from an auto-encoder network. More recently, Zotti *et al.* proposed a method based on the grid-net architecture that embeds a cardiac shape prior to segment MR images [11]

. Their shape prior encodes the probability of a 3D location point being a member of a certain class and is automatically registered with the last feature maps of their network. Finally, Duan

*et al.*implemented a shape-constrained bi-ventricular segmentation strategy [3]. Their pipeline starts with a multi-task deep learning approach that aims to locate specific landmarks. These landmarks are then used to initialize atlas propagation during a refinement stage of segmentation. Although the use of an atlas improves significantly the quality of the results, their final segmented shapes strongly depend on the accuracy of the located landmarks. From these studies, it appears that only soft constraints are currently imposed in the literature to steer the segmentation outputs towards a reference shape. As we will be shown in this paper, shape-prior methods are not immune to producing anatomically incorrect results.

Another simple way of reducing the number of anatomically inaccurate results is through the use of post-processing tools. It typically involves morphological operators or some connected component analysis to remove small isolated regions. Unfortunately, such post-processing methods cannot guarantee the anatomical plausibility of every segmentation map.

In this paper, we present the first deep learning formalism which guarantees the anatomical plausibility of cardiac shapes. Our method can be plugged to the output of any segmentation method as it would reduce to zero its number of anatomically invalid shapes while preserving the overall quality of its results.

## 2 Proposed Framework

As shown in Fig. 1

, our method has three main blocks namely: i) an adversarial VAE that learns a 32-dim latent representation of anatomically correct cardiac shapes, ii) an anatomically-constrained data augmentation of the latent vectors and iii) a post-processing VAE which converts erroneous segmentation maps into anatomically plausible ones. The anatomical guarantees that our method provides comes from a transformation function that replaces the latent vector of an anatomically erroneous shape by a close but anatomically correct one.

### 2.1 Cardiac MR Images and Anatomical Metrics

The goal of our method is to produce cardiac segmentation maps with strong anatomical guarantees from short-axis cine-MRI. In that perspective we defined 16 anatomical metrics that will be used to detect incorrect cardiac shapes.

We first consider any holes in the LV, the RV or the MYO, and between the LV and the MYO and between the RV and the MYO as being anatomically impossible. The presence of more than one LV, RV or MYO is also considered implausible. We also measure if the RV is disconnected from the MYO, if the LV touches the RV or the background, and if the LV, RV and MYO suffer from unusually acute concavities. The threshold beyond which a concavity is considered abnormal was defined based on the groundtruth of the ACDC training set (c.f. section 3). We also implemented a circularity metric for the LV and the MYO which is the ratio of their area to that of a circle having the same perimeter. Again, the threshold for that ratio was obtained from the ACDC training set. Please note that since these metrics are not included in the loss, they do not need to be differentiable.

### 2.2 Adversarial Variational Autoencoder (aVAE)

VAEs [7]

are encoder/decoder unsupervised learning methods used to derive a latent representation of a set of data. In our case, the encoder takes as input a cardiac segmentation map

and outputs the parameters ( and ) of a Gaussian probability density where is a latent vector. The decoder takes in a latent variable sampled from and outputs , a reconstructed version of the input cardiac shape .In our method, we implemented an adversarial VAE (aVAE) [10] which forces the latent space to be as linear as possible. The constraint comes in the form of a single-layer neural network [1] trained simultaneously with the rest of the VAE. This neural network is used to predict the slice index of the input image given its latent vector using a regression loss. Since the regression’s gradient signal propagates through the encoder, it forces it to learn a more linear (and thus less convoluted) latent space. As will be shown in Section 3.2

, a smoother latent space makes interpolated values more anatomically plausible.

### 2.3 Anatomically-Constrained Data Augmentation

Once the aVAE is trained, every groundtruth short-axis cardiac shape is projected onto the 32d latent space. Since the ACDC dataset [4] contains a total of short axis maps, the latent space gets populated by latent vectors . These latent vectors are ”anatomically correct” since the deterministic aVAE decoder can convert them back to anatomically valid cardiac shapes. Unfortunately, these vectors are too few to densely populate the 32d manifold of anatomically correct latent vectors.

To solve that problem, we increase the number of anatomically correct latent vectors with a rejection sampling (RS) method [8]. The goal is to produce a new set of latent vectors such that the distribution of the newly generated samples is close to , the distribution from which derive the original points. RS generates a series of samples iid of but based on a second and easier to sample pdf

(in our case a zero-centered Gaussian distribution with a variance of 2). A key idea with RS is that

where . Given and , the sampling procedure first generates a random sample iid of as well as a uniform random value . If then is kept, otherwise it is rejected.In our case, in addition to an increased number of latent vectors, we want those new vectors to correspond to anatomically correct cardiac shapes. As such, we redefine the RS criterion as follows:

(1) |

where is an indicator function which returns 1 when the decoded latent vector is a valid cardiac anatomy and zero otherwise. We call this operation an anatomically-constrained rejection sampling augmentation. This procedure is repeated up until the right number of samples is generated. Since in our case is unknown, we estimate it with a Parzen window distribution [1]. This operation allows us to generate 4 million latent vectors which all have a valid cardiac shape, i.e. that respect all 16 metrics defined in Section 2.1. Images of generated samples are provided in the supplementary materials.

### 2.4 Latent Vector Transformation

Our system contains a post-processing VAE (at the bottom right of Fig. 1) used to convert erroneous segmentation maps into anatomically valid segmentations. The post-processing VAE has the same architecture and the same weights as the aVAE. Thus, any erroneous segmentation map fed to the VAE gets projected into the same latent space as that of the aVAE. Furthermore, since the VAE is deterministic, any anatomically valid latent vector is guaranteed to be converted into an anatomically correct cardiac shape.

The goal is to transform the latent vector of an erroneous cardiac shape to a similar but anatomically valid latent vector . This transformation can be summarized as follows:

(2) |

Said otherwise, the goal is to find the anatomically valid latent vector that is the closest to . Unfortunately, since involves the 16 non-differentiable metrics, this function cannot be minimized with a usual Lagrangian formulation. As a solution, we redefined the problem of finding as a problem of finding the smallest vector such that . In this paper, we recover based on the nearest neighbor in the augmented latent space. In this way, where is the nearest neighbor of in the augmented latent space and . This leads to an easier 1D optimization problem:

(3) |

that we solve with a dichotomic search. At each iteration, the remaining search space of is divided in two and the anatomical criterion specifies which of the upper-half or lower-half should be divided at the next iteration. Since the search space decreases exponentially fast, the optimization algorithm is stopped after five iterations.

VAE | VAE + regist | VAE + adv | VAE + adv + regist |

5.84 | 5.85 | 8.48 | 1.25 |

### 2.5 Implementation Details

The encoder of our aVAE has ten

convolution layers with stride 2 with ELU

[2] activation layers which output a 32-dim latent vector. The decoder follows the same architecture except for the transposed convolutions that increase the feature maps resolution. For the adversarial network, we used a single-layer neural network with an L2 regression loss. The whole network is trained end-to-end using Adam [6] with a learning rate of and a weight regularization with . Note that the segmentation maps fed to our VAEs have a size of and are registered so the center of the LV is in the middle of the image. This translation and rotation registration is done at runtime.## 3 Experimental Setup and Results

### 3.1 Dataset, evaluation criteria, and other methods

We trained and tested our method on the 2017 ACDC dataset [4] which contains cine-MR images of 150 patients, 100 for training and 50 for testing. As shown in Fig. 2, the LV, RV and MYO of every patient has been manually segmented. We report the average 3D Dice index and Hausdorff distance (HD) for the LV, RV and MYO as well as the LV and RV ejection fraction (EF) absolute error. Since our approach can accommodate any segmentation method, we tested it on the test results reported by the ten ACDC challengers. Their methods are summarized by Bernard *et al.* [4] except for Zotti-2 [11] whose results have been uploaded recently. We also report results for the ACNN method of Oktay *et al.* [5] that uses a latent anatomical prior to train a segmentation CNN. Results from our best implementation (which involves a U-Net and our VAE) are very close to that of the original paper despite the fact that the ACDC training set is smaller than the one they used. HD values are also slightly larger since we use a 3D HD instead of a 2D HD as in the original paper.

Submissions | Original | VAE | Nearest Neighbors | ||
---|---|---|---|---|---|

w/o RS | w/ RS | Dicho | |||

Zotti-2 | 55 | 16 | 0 | 0 | 0 |

Khened | 55 | 16 | 0 | 0 | 0 |

Baumgartner | 79 | 17 | 0 | 0 | 0 |

Zotti | 82 | 15 | 0 | 0 | 0 |

Grinias | 89 | 12 | 0 | 0 | 0 |

Isensee | 128 | 21 | 0 | 0 | 0 |

Rohé | 287 | 40 | 0 | 0 | 0 |

Wolterink | 324 | 42 | 0 | 0 | 0 |

Jain | 185 | 28 | 0 | 0 | 0 |

Yang | 572 | 182 | 0 | 0 | 0 |

ACNN | 139 | 41 | 0 | 0 | 0 |

### 3.2 Experimental Results

#### 3.2.1 Adversarial variational autoencoder

We validated the design of our aVAE through the ablation study of Table 1. Since our post-processing method relies on latent vector interpolation (c.f. Eq (3)), we computed the percentage of anatomically implausible results obtained after interpolating two valid latent vectors. To do so, we iteratively selected the groundtruth of two random slices from two random patients of the ACDC test set, encoded it to the latent space with the aVAE encoder and linearly interpolated 25 new vectors. We then converted these 25 vectors to segmentation maps with the aVAE decoder and computed their percentage of anatomical errors. We repeated that process 500 times for the aVAE with and without registration and with and without an adversarial regression loss. As can be seen, the use of registration and an adversarial regression loss reduces the percentage of anatomically implausible results down to 1.25% which is more than 4x lower than for the other configurations.

Submissions | Original | VAE | Nearest Neighbors | ||
---|---|---|---|---|---|

w/o RS | w/ RS | Dicho | |||

Zotti-2 | .913/9.7 | .910/10.1 | .899/14.4 | .909/11.0 | .910/10.1 |

Khened | .915/11.3 | .912/12.3 | .894/15.2 | .909/12.7 | .912/10.9 |

Baumgartner | .914/10.5 | .911/11.2 | .889/18.2 | .907/12.6 | .910/10.6 |

Zotti | .910/9.7 | .907/10.9 | .878/19.6 | .903/12.6 | .907/11.0 |

Grinias | .835/15.9 | .833/19.3 | .752/32.5 | .825/16.9 | .833/15.8 |

Isensee | .926/9.1 | .923/10.7 | .881/18.4 | .917/11.2 | .923/9.2 |

Rohé | .891/12.2 | .887/14.6 | .756/32.2 | .874/15.1 | .887/12.8 |

Wolterink | .907/10.8 | .903/13.0 | .752/32.8 | .887/13.5 | .903/11.0 |

Jain | .891/12.2 | .886/12.6 | .820/31.9 | .878/14.2 | .886/11.6 |

Yang | .800/27.5 | .752/21.7 | .455/29.7 | .722/11.5 | .752/10.2 |

ACNN | .892/12.3 | .886/26.2 | .885/12.0 | .885/12.2 | .889/13.1 |

#### 3.2.2 Postprocessing results

Results on the ACDC test set are in Table 2, Table 3 and Table 4. Table 2 contains the total number of slices with at least one anatomical error, Table 3 shows the overall Dice index and HD, and Table 4 the LV and RV EF absolute errors. Results without our post-processing are under the Original column. As one can see, every method produces a non-negligible number of anatomical errors considering that the ACDC testset has a total of slices.

By feeding every erroneous segmentation map to our VAE without transforming the latent vector , we get to drastically reduce the number of anatomical errors without affecting too much the HD, the Dice and the EF. This comes as no surprise since the VAE was trained to reproduce groundtruth (and thus anatomically correct) cardiac shapes. However, like any neural network, a basic VAE provides no guarantee on the quality of its output. To completely eliminate erroneous segmentations, we first swap erroneous latent vectors with their nearest neighbor (i.e. by fixing to 1 in Eq. 3) using groundtruth data from the ACDC training set (i.e. short axis maps) without RS augmentation (w/o RS). While that procedure eliminated every anatomical error and did not change much the EF error, the Dice index and HD suffered considerably. We then tested the same method but with the latent space augmented by 4 million anatomically correct vectors (c.f. Section 2.3). This approach (w/ RS) also provides strong anatomical guarantees but better Dice and HD than without RS. The last column shows the results of our complete method, i.e. Eq.( 3) optimized with a dichotomic search. While results are all anatomically correct, the EF error, the Dice index and the HD are almost identical to that of the original methods, thus showing that our approach does not degrade the overall results but only warps anatomically incorrect results towards the closest anatomically viable shape. Fig. 2 shows erroneous predictions before and after our post-processing. While the correct areas are barely affected by our method, erroneous sections, big or small, get smoothly warped. Our method takes roughly 1 sec to process a 2D image on a mid-end computer equipped with a Titan X GPU.

Submissions | Original | VAE | Nearest Neighbors | ||
---|---|---|---|---|---|

w/o RS | w/ RS | Dicho | |||

Zotti-2 | 2.54/5.11 | 2.63/5.12 | 2.49/5.57 | 2.58/5.18 | 2.62/5.18 |

Khened | 2.39/5.24 | 2.41/4.96 | 2.70/5.36 | 2.63/5.07 | 2.42/5.27 |

Baumgartner | 2.58/6.00 | 2.62/6.30 | 2.83/6.72 | 2.85/6.48 | 2.64/6.33 |

Zotti | 2.98/5.48 | 2.98/5.42 | 3.06/5.72 | 3.10/5.71 | 3.06/5.59 |

Grinias | 4.14/7.39 | 4.18/7.86 | 4.67/8.00 | 4.33/7.35 | 4.01/7.43 |

Isensee | 2.16/4.85 | 2.15/4.61 | 2.49/5.58 | 2.35/4.48 | 2.20/4.82 |

Rohé | 2.84/8.18 | 2.95/7.85 | 3.13/8.93 | 3.39/7.97 | 2.91/8.11 |

Wolterink | 2.75/6.59 | 2.82/6.39 | 3.40/6.93 | 3.48/6.07 | 2.84/6.44 |

Jain | 4.36/8.49 | 4.35/8.83 | 4.98/9.63 | 4.59/8.69 | 4.40/8.72 |

Yang | 6.22/15.99 | 6.80/20.56 | 7.57/27.9 | 7.77/22.09 | 9.10/21.76 |

ACNN | 2.46/3.68 | 2.53/4.09 | 2.51/3.89 | 2.96/3.82 | 2.50/3.71 |

## 4 Conclusion

We presented a post-processing VAE which converts anatomically invalid cardiac shapes into close but correct shapes. Our method relies on 16 anatomical metrics that we use both to detect abnormalities and populate an aVAE latent space. Since those metrics are not included in the loss, they need not be differentiable. According to the inter- and intra-expert variations reported by Bernard *et al.*[4], methods such as Isensee *et al.*, Zotti-2, Khened and Baumgartner are on average as accurate as an expert and, with our post-processing method, are now guaranteed to produce anatomically plausible results, a unique and fundamental outcome for future clinical applications.

## References

- [1] C. M. Bishop. Pattern recognition and machine learning, 5th Ed. Springer, 2007.
- [2] D-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). In ICLR, 2016.
- [3] J. Duan, G. Bello, J. Schlemper, and et al. Automatic 3d bi-ventricular segmentation of cardiac images by a shape-refined multi-task deep learning approach. IEEE TMI, PP:1–1, 2019.
- [4] O. Bernard et al. Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE TMI, 37:2514–2525, 2018.
- [5] O. Oktay et al. Anatomically constrained neural networks (acnns): Application to cardiac image enhancement and segmentation. IEEE-TMI, 37(2), 2017.
- [6] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
- [7] D. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2013.
- [8] D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
- [9] M. Salerno M and B. Sharif B H. Arheden et al. Recent advances in cardiovascular magnetic resonance: Techniques and applications. Circ Card. Imaging, 10(6), 2017.
- [10] A. Makhzani, J. Shlens, N. Jaitly, and I. J. Goodfellow. Adversarial autoencoders. In ICLR, 2016.
- [11] C. Zotti, Z. Luo, A. Lalande, and P-M Jodoin. Convolutional neural network with shape prior applied to cardiac mri segmentation. in press at IEEE JBHI, 2018.

Comments

There are no comments yet.