Log In Sign Up

Interpretable Anomaly Detection in Echocardiograms with Dynamic Variational Trajectory Models

by   Alain Ryser, et al.

We propose a novel anomaly detection method for echocardiogram videos. The introduced method takes advantage of the periodic nature of the heart cycle to learn different variants of a variational latent trajectory model (TVAE). The models are trained on the healthy samples of an in-house dataset of infant echocardiogram videos consisting of multiple chamber views to learn a normative prior of the healthy population. During inference, maximum a posteriori (MAP) based anomaly detection is performed to detect out-of-distribution samples in our dataset. The proposed method reliably identifies severe congenital heart defects, such as Ebstein's Anomaly or Shonecomplex. Moreover, it achieves superior performance over MAP-based anomaly detection with standard variational autoencoders on the task of detecting pulmonary hypertension and right ventricular dilation. Finally, we demonstrate that the proposed method provides interpretable explanations of its output through heatmaps which highlight the regions corresponding to anomalous heart structures.


page 4

page 6

page 8

page 15

page 18

page 21

page 22

page 23


Adversarially Learned Anomaly Detection on CMS Open Data: re-discovering the top quark

We apply an Adversarially Learned Anomaly Detection (ALAD) algorithm to ...

A comparison of classical and variational autoencoders for anomaly detection

This paper analyzes and compares a classical and a variational autoencod...

Anomaly localization by modeling perceptual features

Although unsupervised generative modeling of an image dataset using a Va...

Anomaly Detection With Conditional Variational Autoencoders

Exploiting the rapid advances in probabilistic inference, in particular ...

Automatic Feature Extraction for Heartbeat Anomaly Detection

We focus on automatic feature extraction for raw audio heartbeat sounds,...

FlexFringe: Modeling Software Behavior by Learning Probabilistic Automata

We present the efficient implementations of probabilistic deterministic ...

1 Introduction

Congenital heart defects (CHDs) account for about 28% of all congenital defects worldwide (Van Der Linde et al., 2011). CHDs manifest in several different heart diseases with various degrees of frequency and severity, and are usually diagnosed primarily with echocardiography. Echocardiography is one of the most common non-invasive screening tools due to the rapid data acquisition, low cost, portability, and measurement without ionizing radiation. Early screening of heart defects in newborns is crucial to ensure the long-term health of the patient (Buskens et al., 1996; Singh and McGeoch, 2016; Van Velzen et al., 2016)

. However, due to the subtlety of various heart defects and the inherently noisy nature of echocardiogram video (echo) data, a thorough examination of the heart and the diagnosis of CHD remains a challenging and time-consuming process, raising the need for an automated approach. Still, collecting real-world datasets from large populations to apply state-of-the-art supervised deep learning methods is often infeasible. The reason is that many CHDs like Ebstein’s Anomaly, Shone-complex, or complete atrioventricular septal defect (cAVSD) rarely occur, making the dataset extremely imbalanced. On the other hand, we have access to an abundance of echos from healthy infant hearts generated during standard screening procedures, often performed on infants shortly after birth.

In this work, we introduce a novel anomaly detection method to identify a variety of CHDs. The proposed approach learns a structured normative prior of healthy newborn echos using a periodic variational latent trajectory model. At test time, the method can detect out-of-distribution samples corresponding to CHDs. The advantage of this approach is that the model is trained purely on healthy samples, eliminating the need to collect large amounts of often rarely occurring CHDs.

In anomaly detection, we assume that all data is drawn from a space

with some probability density

. Anomalies are then defined to be samples drawn from low probability regions of under . More formally, an anomaly space under density and anomaly threshold is defined by

Note that is a task-specific measure, as the definition of anomaly can vary drastically over different problem settings. Consequently, most anomaly detection algorithms assign anomaly scores rather than discriminating between normal and anomalous samples.

In this work we focus on reconstruction-based approaches, which encompass some of the most widespread methods for anomaly detection (Chalapathy and Chawla, 2019; Ruff et al., 2021; Pang et al., 2021). This family of methods aims to learn generative models that can reconstruct normal samples well but decrease in performance for anomalous inputs. A given measure that quantifies the reconstruction quality achieved by model when given sample can then be interpreted as the anomaly score of . The models are commonly trained on healthy samples, and during inference, an anomalous sample is assumed to get projected into the learned normal latent space. This effectively leads to high reconstruction errors, resulting in high anomaly scores . More recently, (Chen et al., 2020) proposed a variation of the reconstruction-based approach that allows us to incorporate prior knowledge on anomalies during inference by detecting anomalies using a maximum a posteriori

(MAP) based approach. However, this approach requires an estimate of the log-likelihood, which restricts model choice to generative models such as

variational autoencoders (VAE Kingma and Welling (2013)).

Although various generative architectures have been proposed in the literature, little effort has been directed toward echocardiogram videos. One exception is the work of Laumer et al. (2020), where the authors introduced a model that specifically targets the periodicity of heartbeats for ejection fraction prediction and arrhythmia classification. However, the model enforces rather restrictive assumptions on the heart dynamics and is purely deterministic in nature. In contrast, we propose a variational latent trajectory model that overcomes the simplistic assumptions of previous approaches and learns a distribution over dynamic trajectories, enabling the detection of different types of CHDs in echocardiograms using the MAP approach. Furthermore, the proposed algorithm allows us to explain predictions by producing heatmaps that highlight regions corresponding to detected anomalies, which ultimately helps clinicians in building trust in the proposed approach. We provide the code to our models on GitHub111

To summarize, the contributions of this paper are the following:

  1. We propose a novel variational latent trajectory model (TVAE) for reconstruction-based anomaly detection on echocardiogram videos.

  2. We perform extensive evaluation of the proposed method on the challenging task of CHD detection in a real-world dataset.

  3. We complement our predictions with decision heatmaps, which highlight the regions of the echocardiograms corresponding to anomalous heart structures.

2 Related Work

The rapid data acquisition, the high observer variation in their interpretation, and the non-invasive technology have made echocardiography a suitable data modality for an abundance of machine learning algorithms. In recent years, a variety of algorithms for segmentation (Dong et al., 2016; Moradi et al., 2019; Leclerc et al., 2019), view classification (Gao et al., 2017; Vaseli et al., 2019) or disease prediction (Madani et al., 2018; Kwon et al., 2019) have been proposed. However, their performance often relies on the assumption that a large labeled dataset can be collected. This assumption does not hold for rare diseases, where the amount of collected data is often too scarce to train a supervised algorithm. Hence, reconstruction-based anomaly detection algorithms could be used in such a setting, but their application to echocardiography is, to the best of our knowledge, left unexplored.

Previous work on reconstruction based anomaly detection are often based on generative models, such as autoencoders (AE) (Principi et al., 2017; Chen et al., 2017; Chen and Konukoglu, 2018; Pawlowski et al., 2018) or variational autoencoders (VAE Kingma and Welling (2013)) (An and Cho, 2015; Park et al., 2018; Xu et al., 2018; Cerri et al., 2019; You et al., 2019). Their application to the medical domain is mostly limited to disease detection in MRI (Baur et al., 2018; Chen and Konukoglu, 2018; Baur et al., 2020; Chen et al., 2020; Baur et al., 2021; Pinaya et al., 2021), where anomalies are often easily detectable as they are clearly defined by regions of tissue that contain lesions. On the other hand, pathologies of CHDs in echos are largely heterogeneous and usually cannot be described by unique structural differences from healthy echos. Identifying them is often challenging, as they can be caused by small perturbations of ventricles (ventricular dilation) or subtle malfunctions like pressure differences between chambers in certain phases of the cardiac cycle (pulmonary hypertension). Detecting certain CHDs thus requires the inclusion of temporal structures in addition to the spatial information leveraged in MRI anomaly detection.

Different extensions to AE/VAE have been proposed to perform reconstruction-based anomaly detection methods on video data (Xu et al., 2015; Hasan et al., 2016; Yan et al., 2018). However, these methods are often designed for abnormal event detection, where anomalies can arise and disappear throughout the video. On the other hand, we are interested in whether a given video represents a healthy or anomalous heart. Another method for video anomaly detection is future frame prediction (Liu et al., 2018), which trains models to predict a video frame from one or more previous ones. During inference, it is assumed that such a model achieves better performance on normal than on anomalous frames. Recently, (Yu et al., 2020) proposed a method that combines reconstruction and future frame prediction-based approaches in one framework. Though achieving good performance on videos with varying scenes, future frame prediction does not seem suitable for echos as returning any input frame will always lead to good prediction scores due to the periodic nature of the cardiac cycle. An entirely different approach to anomaly detection is given by One-Class Classification (Moya and Hush, 1996). In contrast to the previous approaches, the latter relies on discriminating anomalies from normal samples instead of assigning an anomaly score. This is usually achieved by learning a high-dimensional manifold that encloses normal data. The surface of this manifold then serves as a decision boundary that discriminates anomalies from normal samples. One of the more prominent methods of that family is the so-called

Support Vector Data Description

(SVDD) (Tax and Duin, 2004) model. The SVDD learns parameters of a hypersphere that encloses the training data. Similar to SVMs, it provides a way to introduce some slack into the estimation process, allowing certain normal samples to lie outside the decision boundary. A similar approach is given by the One-Class SVMs (OC-SVM) (Schölkopf et al., 2001)

, where anomalies are discriminated from normal samples by learning a hyperplane instead of a hypersphere. Like with SVMs, the expressivity of SVDD and OC-SVM can be drastically improved by introducing kernelized versions

(Ratsch et al., 2002; Ghasemi et al., 2012; Dufrenois, 2014; Gautam et al., 2019)

. More recently, deep neural networks have been proposed to perform anomaly detection based on similar principles

(Ruff et al., 2018; Sabokrou et al., 2018; Ruff et al., 2020; Ghafoori and Leckie, 2020). While conceptually interesting, One-Class Classification methods often require large amounts of data to work accurately, making them unsuitable in many clinical applications.

3 Methods

In this work, we propose a probabilistic latent trajectory model to perform reconstruction-based anomaly detection on echocardiogram videos. We take inspiration from latent trajectory models (Louis et al., 2019; Laumer et al., 2020) and introduce the trajectory variational autoencoder (TVAE), which learns a structured normative distribution of the heart’s shape and dynamic. In particular, the model encodes the echos into stochastic trajectories in the latent space of a VAE, enabling us to accurately generate high-quality reconstructions while maintaining a low dimensional latent bottleneck. We present three different TVAE variants. The TVAE-C and TVAE-R leverage trajectories that assume strict periodic movements of the heart, while TVAE-S is more general and allows shifts in the spatial representation throughout the video, improving the quality of the normative prior. The learned approximate distribution of healthy hearts then allows us to detect anomalies post-hoc using a maximum a posteriori (MAP) approach (Chen et al., 2020). High-quality normative reconstructions and informative latent representations are essential to detect out-of-distribution echos correctly.

3.1 Latent Trajectory Model

Figure 1: Overview of the model architecture with (left), (middle) and (right).

The latent trajectory model (Laumer et al., 2020) is an autoencoder that is designed to learn latent representations from periodic sequences of the heart, i.e. echos in this case. The main idea is to capture the periodic nature of the observed data by learning an encoder that maps an echo with frames at time points to a prototypical function whose parameters contain information about the heart’s shape and dynamic. The decoder reconstructs the original video frame by frame from the latent embedding with . Here, corresponds to the following cyclic trajectory:

where the frequency parameter, , corresponds to the number of cycles per time unit, and the offset parameter allows the sequence to start at an arbitrary point within the (cardiac) cycle. The parameter characterizes the spatial information of the signal. This model thus describes a simple tool to learn the disentanglement of temporal components (, ) from a common spatial representation () for a given echo. On the other hand, the assumptions made may be too simplistic to result in good reconstructions. We will address this issue in the following sections.

3.2 Dynamic Trajectories

The above formulation, , allows modeling time-related information only through the first two latent dimensions, thereby limiting the amount of time-dependent information that can be encoded in the latent space. The reduced flexibility results in insufficient reconstruction quality, impairing the reconstruction-based anomaly detection performance. To circumvent this problem, we distribute time-dependent components over each dimension of the latent space while retaining the periodicity. We thus define the rotated trajectory function as

Furthermore, in real-world applications, it is often the case that doctors change certain settings of the echocardiogram machine during screening to get better views of certain cardiac structures. Additionally, some patients might slightly move while scans are performed, which leads to a displacement of the heart with respect to the transducer position throughout an echo recording. This is particularly prominent in our in-house dataset, which consists of echocardiograms of newborn children. Such echocardiograms are not necessarily well represented with a simple periodic trajectory. Over multiple cycles, the spatial structure of a sample shifts and looks different than in the beginning, even though temporal information like the frequency or phase shift is preserved. The trajectory model described by thus fails in such scenarios, which can manifest in two different ways: either the model incurs a local optima with high reconstruction error, or the model reconstructs the video from one long cycle, hence not leveraging the heart cycle periodicity. Thus, to account for movements of the recording device, we extend with a velocity parameter that allows the model to learn gradual shifts of the latent trajectory over time, resulting in a trajectory that is no longer circular but a spiral embedded in high dimensional space. More formally, we define the spiral trajectory function as

3.2.1 Variational Formulation

Previous work often applied VAEs to anomaly detection, as its generative nature enables more sophisticated variants of reconstruction-based anomaly detection (Baur et al., 2018; Xu et al., 2018; Chen et al., 2020). Thus, we extend the presented model with a stochastic layer and introduce the variational latent trajectory model.

We modify the encoder such that it outputs trajectory parameters and . The model is then extended with a stochastic layer by defining . While we aim to learn a distribution over heart shapes, we would also like to accurately identify the frequency , phase shift , and spatial shift given an echo video , instead of sampling them from a latent distribution. We thus leave those parameters deterministic. Next, we define an isotropic Gaussian prior on and assume that

where is our decoder with weights and is some fixed constant. Given these assumptions, we are able to derive the following evidence lower bound (ELBO):

Here, , and are the trajectory parameter outputs of the encoder for and , respectively. Note that VAEs on and are defined in a similar fashion. A derivation of this ELBO can be found in Appendix A.

3.2.2 Anomaly detection

The variational formulation of the latent trajectory model allows us to perform anomaly detection by Maximum a Posteriori (MAP) inference as proposed in Chen et al. (2020). The authors suggest that anomalies can be modeled as an additive perturbation of a healthy sample. Following their reasoning we define:

where is the healthy data distribution, the overall data distribution (i.e., including anomalies) and the anomalous perturbation. We then assume that

In the case of CHD, could, e.g., remove walls between heart chambers or produce holes in the myocardium for specific frames. The anomaly score can then be defined as . When training on healthy samples only, i.e. for all , the variational latent trajectory model learns to approximate by maximizing . We then use the MAP estimation to approximate the posterior distribution of given a sample

. Hence, by using Bayes’ theorem

we can estimate as follows:

where we use the concavity of the logarithm and the fact that . Next, we compute and calculate the anomaly score as . Similar to (Chen et al., 2020), we choose , where denotes the Total Variation norm in , as this leverages the assumption that anomalies should consist of contiguous regions rather than single pixel perturbations. Note that since we have a temporal model, we can incorporate temporal gradients into the TV norm, i.e.,

In our experiments, we approximate gradients by

4 Experiments

All experiments are conducted on a novel in-house dataset of echocardiograms of newborns. We perform three separate anomaly detection tasks, namely detecting severe structural defects (SSD), right ventricular dilation (RVDil), or pulmonary hypertension (PH). For each task, we define samples that do not contain the respective lesion as part of the normal distribution. We perform all anomaly detection tasks on both the apical four-chamber (4CV) and the parasternal long-axis (PLAX) view. Videos were preprocessed and resampled such that they consist of

frames. More details on the collected dataset and preprocessing can be found in Appendix B.

In addition to the variational latent trajectory models with the circular (TVAE-C), rotated (TVAE-R), and spiral (TVAE-S) trajectories described in Section 3.1, we train a standard variational autoencoder (Kingma and Welling, 2013) model on the individual video frames of the dataset as a baseline. We present an outline of the model architecture in Figure 1 and refer to Appendix C for a more detailed description of the network.

We run experiments by training the models exclusively on samples that do not contain the corresponding CHD to learn the normative prior. Each experiment is trained on separate data splits, where we leave out the respective anomalous samples, healthy samples for PH and RVDil, and healthy samples for SSD to evaluate on test time.

4.1 Reconstruction

(a) Healthy reconstructions
(b) SSD reconstructions
Figure 2: Examples of healthy (a) and SSD (b) samples (first and third rows) and their reconstructions (second and fourth rows) using the TVAE-S model. We sample frames from each echo’s frame long sequences.
Table 1:

Apical 4-chamber view reconstruction performance on test data of the proposed approaches (TVAE-C, TVAE-R and TVAE-S) compared with the baseline (VAE). Means and standard deviations are computed across

validation splits.

The reconstruction quality is directly related to reconstruction-based anomaly detection performance, as we rely on the manifold and prototype assumptions formalized in (Ruff et al., 2021). The manifold assumption is often used in many machine learning-based applications and states that , the space of healthy echos, can be generated from some latent space by a decoding function and that it is possible to learn a function that encodes into . The better a function reconstructs on a test set, the better we match the manifold assumption. On the other hand, the prototype assumption assumes that there is some set of prototypes that characterizes the healthy distribution well. In our case, the prototypes would be echos corresponding to healthy hearts, i.e., a subset of . Under the prototype assumption, our model must be able to assign a given sample to one of the learned prototypes, i.e., project anomalies to the closest healthy echo.

Figure 3: Projection of 4CV view anomalous echo (top) to healthy prototype (bottom). Projections of right (R) and left (L) ventricle (V) and atrium (A) are highlighted in color. The reconstruction of SSD samples approximates a healthy version of the input, e.g., by normalizing the scale of the right and left ventricles (left), adding the ventricular septum (middle), or fixing the location of the valves (right).

Table 1 contains the scores of the VAE, TVAE-C, TVAE-R, and TVAE-S. We report the Mean Squared Error (MSE),

Peak Signal to Noise Ratio

(PSNR), and Structural Similarity Index Measure (SSIM). We observe that TVAE-C has consistently higher MSE and SSIM errors and lower PSNR than both TVAE-R and TVAE-S. Upon inspection of the reconstructed test videos we notice that, for most seeds, TVAE-C converges to a local optimum where the model learns mean representations of the input videos, thus ignoring the latent dimensions containing temporal information, as described in Section 3. On the other hand, we did not observe this behavior in TVAE-R and TVAE-S, suggesting that these models indeed capture dynamic properties of echos through the learned latent representations. Additionally, TVAE-S achieves good echo reconstructions even for samples with transducer position displacement, improving upon TVAE-R and achieving similar performance as VAE despite having a smaller information bottleneck. The proposed approaches, TVAE-C, TVAE-R, and TVAE-S, encode videos into or trajectory parameters respectively, while the VAE encodes each individual frame in , resulting in a total of latent parameters. In conclusion, TVAE-S and the standard VAE fulfill the manifold assumption similarly well. Figure 2 presents reconstructed healthy and SSD samples for the 4CV and PLAX echo views.

In Figure 3, we qualitatively demonstrate that TVAE satisfies the prototype assumption. We observe how the perturbed septum and enlarged/shrunken heart chambers of SSD anomalies are projected to healthy echo reconstructions.

Appendix D provides additional reconstructions and a comprehensive performance comparison of the deterministic and variational models for the 4CV and PLAX echo views.

4.2 Anomaly Detection

Table 2: Anomaly detection performance in terms of area under the curve and average precision of the proposed approaches (TVAE-C, TVAE-R, and TVAE-S) compared with the baseline (VAE) on the four-chamber view and long-axis view for the three different CHD labels. Means and standard deviations are computed on the test sets across

data splits. The anomalous echos are considered as the positive class. AP scores of a random classifier are

(SSD), (RVDil), and (PH).

As described in Section 3.2.2, we detect anomalies by MAP estimation:

Due to the reconstruction loss in the ELBO, this optimization problem requires us to backpropagate through the whole model in every step. As a result, inference with the standard MAP formulation is inefficient and proved infeasible for our experiments. To circumvent this problem, we assumed the reconstruction part of the ELBO to be constant and solely balanced the posterior with the KL-Divergence of the encoded

, i.e., how well is mapped to a standard Gaussian, thus computing

Solving this optimization procedure results in only backpropagating through the encoder instead of the whole model, which leads to a significant speedup.

To optimize this objective we initialize with the reconstructions computed by the respective model, i.e. for model and input . We then solve the inference problem with the Adam optimizer, incorporating a learning rate of and taking optimizer steps per sample. Additionally, we weight the TV norm with a factor of . For each sample , we define the anomaly score as described in Section 3.2.2. Anomaly detection performance is then evaluated in terms of the Area Under the Receiver Operator Curve (AUROC) and Average Precision (AP) when considering the anomalies as the positive class. In Table 2, we provide a complete overview of the results of the anomaly detection experiments over both views.

We observe that the proposed approaches outperform the VAE in all experiments. Especially when detecting SSD anomalies, our models TVAE-C, TVAE-R, and TVAE-S have significantly better performance than the standard VAE. We also note that, despite outperforming TVAE-C and TVAE-R in terms of reconstruction quality, TVAE-S does not always perform better in the anomaly detection task. We explain the score discrepancies between SSD and RVDil/PH by the fact that SSDs deviate considerably from the healthy distribution. On the contrary, RVDil and PH are more subtle and require expert knowledge and several echocardiogram views to be detected in practice.

Additionally, even though the TVAE variations have considerably fewer latent parameters (/) than the VAE (), they achieve similar reconstruction quality performance as demonstrated in Section 4.1. In case of VAE, this gives the optimizer more flexibility when solving the MAP problem since the frames of can be updated independently to encode them on Gaussian parameters close to , which may result in overfitting during MAP estimation.

Another reconstruction based inference method approach where we simply define over the MSE, i.e. , is presented in Appendix E.

4.3 Decision Heatmaps

(a) Healthy
(b) Anomalous
Figure 4: Anomaly response maps of TVAE-R and TVAE-S for healthy samples (a) and echos with CHDs (b). Note how healthy heatmaps are mostly constant, while anomalous maps contain regions with high responses in anomalous regions, corresponding to enlarged ventricles (first/second) or perturbed septums (third/fourth).

In this experiment, we present how the estimated anomaly perturbation can be applied to highlight anomalous regions. Intuitively, anomalous regions of input echos differ more substantially from its healthy projection than healthy regions. Consequently, this leads to higher magnitude values in the corresponding locations in the frames of . In turn, we are able to compute an anomaly heatmap by temporally averaging the estimated anomaly perturbation with . In Figure 4, we present examples of such maps for each TVAE variation. We observe that TVAE not only has consistently low magnitude responses for healthy echos, but regions corresponding to, e.g., enlarged chambers, are well highlighted in the echos with CHDs. These heatmaps provide TVAE with an additional layer of interpretability and could foster the integration of the proposed algorithm in clinical settings, as the reason for the decisions made by TVAE can easily be interpreted by clinicians. This helps practitioners build trust in the model’s decisions and provides a more intuitive explanation of its outputs. More examples of decision heatmaps are provided in Appendix F.

5 Discussion

In this work, we introduce the TVAE; a new generative model designed explicitly for echocardiogram data. We propose three variants of the model, the TVAE-C and TVAE-R, which make strong assumptions about the periodicity of the data, and the TVAE-S, which can handle more dynamic inputs. Throughout this work, we compared the proposed approach to the VAE in terms of its reconstruction performance and anomaly detection capabilities in a new in-house echo dataset consisting of two different echo views of healthy patients and patients suffering from various CHD. In exhaustive experiments, we demonstrated how TVAE can achieve reconstruction quality comparable to VAE while having a significantly smaller information bottleneck. Additionally, we verified that the proposed model can project out-of-distribution samples, i.e., patients suffering from CHD, into the subspace of healthy echos when learning normative priors and concluded that TVAE fulfills crucial assumptions for reconstruction based anomaly detection. Consequently, we evaluated CHD detection performance of our model, where we showed that it leads to a considerable improvement over frame-wise VAE with MAP-based anomaly detection. Furthermore, we demonstrated how TVAE can separate SSD anomalies almost perfectly from healthy echos. Finally, we present the ability of this model to not only detect but also localize anomalies with heatmaps generated from the MAP output, which could help clinicians with the diagnosis of CHDs.

Limitations and Future Work

Even though we observe convincing results for SSD, performance for the detection of RVDil and PH is still insufficient for clinical application. This is not unexpected given that these defects are rather subtle and our relatively small in-house dataset. It would thus be interesting to apply the proposed approach to different and larger cohorts. In the future, we plan to collect more samples for our in-house dataset. With a more extensive dataset at hand, we look forward to exploring methods that would allow combinations of TVAE with one class classification or future frame prediction methods to achieve more robust anomaly detection in echocardiography-based disease detection.

The spiral trajectory of the TVAE-S model assumes continuous movement over the video and might thus still be limiting in situations where sudden movement occurs. Investigating accelerating trajectories could thus be an interesting direction. Further, we want to extend the TVAE to multiple modalities such that it is possible to train a model that learns a coherent latent trajectory of multiple echo views of the same heart. As future work, we are also interested in extending the TVAE to different types of medical modalities by designing trajectory functions that leverage modality-specific characteristics.


  • J. An and S. Cho (2015) Variational autoencoder based anomaly detection using reconstruction probability. Special Lecture on IE 2 (1), pp. 1–18. Cited by: §2.
  • C. Baur, R. Graf, B. Wiestler, S. Albarqouni, and N. Navab (2020) SteGANomaly: inhibiting cyclegan steganography for unsupervised anomaly detection in brain mri. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 718–727. Cited by: §2.
  • C. Baur, B. Wiestler, S. Albarqouni, and N. Navab (2018) Deep autoencoding models for unsupervised anomaly segmentation in brain mr images. In International MICCAI brainlesion workshop, pp. 161–169. Cited by: §2, §3.2.1.
  • C. Baur, B. Wiestler, M. Muehlau, C. Zimmer, N. Navab, and S. Albarqouni (2021)

    Modeling healthy anatomy with artificial intelligence for unsupervised anomaly detection in brain mri

    Radiology: Artificial Intelligence 3 (3), pp. e190169. Cited by: §2.
  • E. Buskens, P. Stewart, J. Hess, D. Grobbee, and J. Wladimiroff (1996) Efficacy of fetal echocardiography and yield by risk category. Obstetrics & Gynecology 87 (3), pp. 423–428. Cited by: §1.
  • O. Cerri, T. Q. Nguyen, M. Pierini, M. Spiropulu, and J. Vlimant (2019) Variational autoencoders for new physics mining at the large hadron collider. Journal of High Energy Physics 2019 (5), pp. 1–29. Cited by: §2.
  • R. Chalapathy and S. Chawla (2019) Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §1.
  • J. Chen, S. Sathe, C. Aggarwal, and D. Turaga (2017) Outlier detection with autoencoder ensembles. In Proceedings of the 2017 SIAM international conference on data mining, pp. 90–98. Cited by: §2.
  • X. Chen and E. Konukoglu (2018) Unsupervised detection of lesions in brain mri using constrained adversarial auto-encoders. arXiv preprint arXiv:1806.04972. Cited by: §2.
  • X. Chen, S. You, K. C. Tezcan, and E. Konukoglu (2020) Unsupervised lesion detection via image restoration with a normative prior. Medical image analysis 64, pp. 101713. Cited by: §1, §2, §3.2.1, §3.2.2, §3.
  • S. Dong, G. Luo, G. Sun, K. Wang, and H. Zhang (2016) A left ventricular segmentation method on 3d echocardiography using deep learning and snake. In 2016 Computing in Cardiology Conference (CinC), pp. 473–476. Cited by: §2.
  • F. Dufrenois (2014)

    A one-class kernel fisher criterion for outlier detection

    IEEE transactions on neural networks and learning systems 26 (5), pp. 982–994. Cited by: §2.
  • X. Gao, W. Li, M. Loomes, and L. Wang (2017) A fused deep learning architecture for viewpoint classification of echocardiography. Information Fusion 36, pp. 103–113. Cited by: §2.
  • C. Gautam, R. Balaji, K. Sudharsan, A. Tiwari, and K. Ahuja (2019) Localized multiple kernel learning for anomaly detection: one-class classification. Knowledge-Based Systems 165, pp. 241–252. Cited by: §2.
  • Z. Ghafoori and C. Leckie (2020) Deep multi-sphere support vector data description. In Proceedings of the 2020 SIAM International Conference on Data Mining, pp. 109–117. Cited by: §2.
  • A. Ghasemi, H. R. Rabiee, M. T. Manzuri, and M. H. Rohban (2012) A bayesian approach to the data description problem. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 26, pp. 907–913. Cited by: §2.
  • M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis (2016) Learning temporal regularity in video sequences. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 733–742. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Document Cited by: Appendix C.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1, §2, §4.
  • J. Kwon, K. Kim, K. Jeon, and J. Park (2019) Deep learning for predicting in-hospital mortality among heart disease patients based on echocardiography. Echocardiography 36 (2), pp. 213–218. Cited by: §2.
  • F. Laumer, G. Fringeli, A. Dubatovka, L. Manduchi, and J. M. Buhmann (2020) DeepHeartBeat: latent trajectory learning of cardiac cycles using cardiac ultrasounds. In Machine Learning for Health, pp. 194–212. Cited by: §1, §3.1, §3.
  • S. Leclerc, E. Smistad, J. Pedrosa, A. Østvik, F. Cervenansky, F. Espinosa, T. Espeland, E. A. R. Berg, P. Jodoin, T. Grenier, et al. (2019) Deep learning for segmentation using an open large-scale dataset in 2d echocardiography. IEEE transactions on medical imaging 38 (9), pp. 2198–2210. Cited by: §2.
  • W. Liu, W. Luo, D. Lian, and S. Gao (2018) Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6536–6545. Cited by: §2.
  • M. Louis, R. Couronné, I. Koval, B. Charlier, and S. Durrleman (2019) Riemannian Geometry Learning for Disease Progression Modelling. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 11492 LNCS, pp. 542–553. External Links: Document, ISBN 9783030203504, ISSN 16113349 Cited by: §3.
  • A. Madani, J. R. Ong, A. Tibrewal, and M. R. Mofrad (2018) Deep echocardiography: data-efficient supervised and semi-supervised deep learning towards automated diagnosis of cardiac disease. NPJ digital medicine 1 (1), pp. 1–11. Cited by: §2.
  • S. Moradi, M. G. Oghli, A. Alizadehasl, I. Shiri, N. Oveisi, M. Oveisi, M. Maleki, and J. Dhooge (2019) MFP-unet: a novel deep learning based approach for left ventricle segmentation in echocardiography. Physica Medica 67, pp. 58–69. Cited by: §2.
  • M. M. Moya and D. R. Hush (1996) Network constraints and multi-objective optimization for one-class classification. Neural networks 9 (3), pp. 463–474. Cited by: §2.
  • D. Ouyang, B. He, A. Ghorbani, N. Yuan, J. Ebinger, C. P. Langlotz, P. A. Heidenreich, R. A. Harrington, D. H. Liang, E. A. Ashley, et al. (2020) Video-based ai for beat-to-beat assessment of cardiac function. Nature 580 (7802), pp. 252–256. Cited by: Appendix C.
  • G. Pang, C. Shen, L. Cao, and A. V. D. Hengel (2021) Deep learning for anomaly detection: a review. ACM Computing Surveys (CSUR) 54 (2), pp. 1–38. Cited by: §1.
  • D. Park, Y. Hoshi, and C. C. Kemp (2018) A multimodal anomaly detector for robot-assisted feeding using an lstm-based variational autoencoder. IEEE Robotics and Automation Letters 3 (3), pp. 1544–1551. Cited by: §2.
  • N. Pawlowski, M. C. Lee, M. Rajchl, S. McDonagh, E. Ferrante, K. Kamnitsas, S. Cooke, S. Stevenson, A. Khetani, T. Newman, et al. (2018) Unsupervised lesion detection in brain ct using bayesian convolutional autoencoders. Cited by: §2.
  • W. H. L. Pinaya, P. Tudosiu, R. Gray, G. Rees, P. Nachev, S. Ourselin, and M. J. Cardoso (2021) Unsupervised brain anomaly detection and segmentation with transformers. arXiv preprint arXiv:2102.11650. Cited by: §2.
  • E. Principi, F. Vesperini, S. Squartini, and F. Piazza (2017)

    Acoustic novelty detection with adversarial autoencoders

    In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3324–3330. Cited by: §2.
  • G. Ratsch, S. Mika, B. Scholkopf, and K. Muller (2002) Constructing boosting algorithms from svms: an application to one-class classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (9), pp. 1184–1199. Cited by: §2.
  • L. Ruff, J. R. Kauffmann, R. A. Vandermeulen, G. Montavon, W. Samek, M. Kloft, T. G. Dietterich, and K. Müller (2021) A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE. Cited by: §1, §4.1.
  • L. Ruff, R. A. Vandermeulen, B. J. Franks, K. Müller, and M. Kloft (2020) Rethinking assumptions in deep anomaly detection. arXiv preprint arXiv:2006.00339. Cited by: §2.
  • L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft (2018) Deep one-class classification. In International conference on machine learning, pp. 4393–4402. Cited by: §2.
  • M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli (2018) Adversarially learned one-class classifier for novelty detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3379–3388. Cited by: §2.
  • B. Schölkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson (2001) Estimating the support of a high-dimensional distribution. Neural computation 13 (7), pp. 1443–1471. Cited by: §2.
  • Y. Singh and L. McGeoch (2016) Fetal anomaly screening for detection of congenital heart defects. J. Neonatal Biol. 5 (2), pp. 100–115. Cited by: §1.
  • D. M. Tax and R. P. Duin (2004) Support vector data description. Machine learning 54 (1), pp. 45–66. Cited by: §2.
  • D. Van Der Linde, E. E. Konings, M. A. Slager, M. Witsenburg, W. A. Helbing, J. J. Takkenberg, and J. W. Roos-Hesselink (2011) Birth prevalence of congenital heart disease worldwide: a systematic review and meta-analysis. Journal of the American College of Cardiology 58 (21), pp. 2241–2247. Cited by: §1.
  • C. Van Velzen, S. Clur, M. Rijlaarsdam, E. Pajkrt, C. Bax, J. Hruda, C. de Groot, N. Blom, and M. Haak (2016) Prenatal diagnosis of congenital heart defects: accuracy and discrepancies in a multicenter cohort. Ultrasound in Obstetrics & Gynecology 47 (5), pp. 616–622. Cited by: §1.
  • H. Vaseli, Z. Liao, A. H. Abdi, H. Girgis, D. Behnami, C. Luong, F. T. Dezaki, N. Dhungel, R. Rohling, K. Gin, et al. (2019) Designing lightweight deep learning models for echocardiography view classification. In Medical Imaging 2019: Image-Guided Procedures, Robotic Interventions, and Modeling, Vol. 10951, pp. 93–99. Cited by: §2.
  • D. Xu, E. Ricci, Y. Yan, J. Song, and N. Sebe (2015) Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553. Cited by: §2.
  • H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, et al. (2018) Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 world wide web conference, pp. 187–196. Cited by: §2, §3.2.1.
  • S. Yan, J. S. Smith, W. Lu, and B. Zhang (2018) Abnormal event detection from videos using a two-stream recurrent variational autoencoder. IEEE Transactions on Cognitive and Developmental Systems 12 (1), pp. 30–42. Cited by: §2.
  • S. You, K. C. Tezcan, X. Chen, and E. Konukoglu (2019) Unsupervised lesion detection via image restoration with a normative prior. In International Conference on Medical Imaging with Deep Learning, pp. 540–556. Cited by: §2.
  • G. Yu, S. Wang, Z. Cai, E. Zhu, C. Xu, J. Yin, and M. Kloft (2020) Cloze test helps: effective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 583–591. Cited by: §2.
  • M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus (2010) Deconvolutional networks. In 2010 IEEE Computer Society Conference on computer vision and pattern recognition, pp. 2528–2535. Cited by: Appendix C.

Appendix A Variational Trajectory Model ELBO derivation

Recall that we define with prior , while leaving the other trajectory parameters deterministic. Note that this effectively means that we define uniform priors , and over their support, while having posteriors

where is the Dirac Delta spiking at and , and are the trajectory parameter outputs of the encoder with weights for and respectively.

Given input sample and latent , recall that VAEs aim to maximize the Evidence LOwer Bound (ELBO):

Here, corresponds to the input echocardiogram whereas .

Note that and are conditionally independent, i.e.

The KL divergence is additive for joint distributions of independent random variables, i.e. for

and , where and are independent, it holds that

We can thus rewrite the ELBO as

Since we assumed a uniform prior for and , their KL-Divergence terms become constant under the Dirac Delta distribution. We can thus ignore the respective terms in the ELBO during optimization as they do not change the result of the .

Additionally, since

we can rewrite the ELBOs reconstruction term as

Finally, this leads to the following reformulation of the ELBO objective:

Appendix B Cohort Examples

Feature Statistic
No. of Patients
No. of Patients with no CHD
No. of Patients with SSD
No. of Patients with PH
No. of Patients with RVDIL
Age (Days) (Mean SD)
Time until birth (Days) (Mean SD)
Weight (Gramms) (Mean SD)
Manufacturer (Ultrasound Machine / Transducer) GE Logic S8 / S4-10 at 6 MHz
Original Video Size (pixelpixels)
Video length (frames) (Mean SD)
Video FPS fps
Table 3: Cohort Statistics

The dataset for this study consists of echos of newborns and infants up to one year of age collected between and at a single center by a single pediatric cardiologist. All examinations were performed with the GE Logic S ultrasound machine and contain D video sequences of at least standard echo views, i.e., apical 4-chamber view (4CV) and parasternal long-axis view (PLAX). Of the patients, suffer from, potentially multiple, CHDs, and are healthy. See Table 3 for more details.

In order to evaluate anomaly detection performance, the dataset was labeled in three different categories by a pediatric cardiologist. These include Pulmonary Hypertension (PH), Right Ventricular Dilation (RVDil) and Severe Structural Defects (SSD). While PH and RVDil are well-defined pathologies, SSD was defined as a category of multiple rare but severe CHD pathologies, including Ebstein’s anomaly, anomalous left coronary artery origin from pulmonary artery (ALCAPA), atrio-ventricular discordance, and ventricular-artery concordance (AVD-VAC), Shone-complex, total anomalous pulmonary venous drainage (TAPVD), tetralogy of fallot (ToF) and complete atrioventricular septal defect (cAVSD). We illustrate examples for healthy, SSD, PH, and RVDil echos of both 4CV and PLAX views in Figure 5.

The collected echocardiograms were preprocessed by resizing them to pixels. Additionally, histogram equalization was performed to increase the contrast of the frames, and pixel values were normalized to the range . For video inputs, we assume that any heart anomaly should always be visible for a certain period over the heart cycle. It thus suffices to have a model that reconstructs only a fixed number of video frames, as long as at least one heart cycle is present in the video. The collected videos are recorded with frames per second (FPS), and we assume that a heart beats at least times a minute. Therefore, we decided to subsample the video frequency to FPS and reconstruct videos with a fixed length of frames, which is enough to capture at least one cycle in every video. Hence, the input for video models consists of concatenated consecutive frames of the subsampled video. Having fixed length inputs enables us to implement more efficient architectures.

As in most clinical applications, the scarcity of the data often leads to overfitting. To prevent this, we apply data augmentation during training by transforming samples with random affine transformations, brightness adjustments, gamma corrections, blurring, and the addition of Salt and Pepper noise before performing the forward pass.

Figure 5: Examples of each label of the cohort in 4CV and PLAX views.

Appendix C Architecture

Hyperparameter AE/VAE TAE/TVAE
Latent Dimension 64 66/67
(b:64; f:1; :1; v:1)
Batch Size 128 64
Steps 5000 106500
Number of Frames 1 25
Optimizer Adam Adam
Learning Rate
Reconstruction Loss MSE MSE
VAE 1 1
Table 4: Hyperparameters chosen across our experiments.
(a) Convolution block
(b) Deconvolution block
(c) Linear block
(d) Residual block
Figure 6: Definitions of the encoder/decoder building blocks.
(a) Encoder
(b) Decoder
(c) Latent space components
Figure 7: Architectures of encoder (a), decoder (b) and latent space components (c). Spatial latent space components are used to learn , and for AE/VAE or , and for TAE/TVAE. Temporal latent space components learn or for TAE/TVAE.

As described in Appendix B, video inputs consist of concatenated frames with timestamps . Hence, we can treat video frames like different channels of an image, and pass them to a residual (He et al., 2016) encoder backbone. Each frame is then individually decoded by passing , or to a deconvolution (Zeiler et al., 2010) based decoder. To train the VAE, we used identical encoder and decoder architectures, only changing the first layer to take a single grayscale channel instead of frames and adapting latent fully connected layers to match dimensions. We provide schematics for the building blocks of our architectures in Figure 6 and describe the encoder/decoder architecture of our experiments in Figure 7.

Table 4 contains the hyperparameters used in our experiments. Except for the number of steps, we kept hyperparameters mostly the same for all models. This is because, in contrast to the frame-wise models, TAE and TVAE models required more steps to converge. We suspect this is because the dimensionality of the input is times larger, and the model thus requires more parameter updates to converge to a suitable optima that results in good reconstructions. Batch size was chosen according to GPU memory capacity. All models are pretrained on the EchoDynamic dataset to speed up training convergence (Ouyang et al., 2020).

Appendix D Further Reconstruction Experiments

In addition to the reconstruction quality experiments provided in Section 4.1, we compared the performance of the variational models to deterministic ones (i.e., standard autoencoder and non-variational trajectory models). As seen in Table 5, the deterministic trajectory models result in a similar performance to the variational models and are even slightly better with respect to the structural similarity score. Even though trained on the same architecture and for the same number of steps as the VAE, the autoencoder did not produce very good reconstruction scores in this experiment. We suspect that this may be an artifact of overfitting due to the small training set.

We provide more reconstructions of TVAE-S in Figure 8.

Figure 8: More TVAE-S reconstructions.
Table 5: Reconstruction scores across all introduced models and compared to AE/VAE.

Appendix E Reconstruction error based anomaly detection

A common alternative to MAP-based anomaly detection is the detection of anomalies purely based on the reconstruction error of the model. This means, for model , sample and data space , we would simply define . In order to quantify the performance of non-variational dynamic trajectory model (TAE) and compare to a standard autoencoder trained on single frame reconstruction, we performed another ablation on AE, VAE, and the variants of TAE and TVAE. Results of this ablation are aggregated in Table 6.

Table 6: Area under the curve and average precision for experiments performed with anomaly score .

Appendix F More Decision Heatmaps

In addition to the heatmaps presented in Section 4.3, we provide a more extensive collection of TVAE-S decision heatmaps in Figure 9 and Figure 10.

Figure 9: More TVAE-S decision heatmaps for healthy echos.
Figure 10: More TVAE-S decision heatmaps for anomalous echos.

Appendix G Generated Videos

The introduced TVAE variations are generative models. As such, in addition to producing good reconstructions of existing samples, they allow us to sample from the learned distribution. To qualitatively validate generative performance, we provide random generations of the TVAE-S model in Figure 11 for both 4CV and PLAX views.

Figure 11: Random TVAE-S generations of samples in 4CV and PLAX views.