Multivariate Time Series Imputation with Variational Autoencoders

07/09/2019 ∙ by Vincent Fortuin, et al. ∙ ETH Zurich 0

Multivariate time series with missing values are common in many areas, for instance in healthcare and finance. To face this problem, modern data imputation approaches should (a) be tailored to sequential data, (b) deal with high dimensional and complex data distributions, and (c) be based on the probabilistic modeling paradigm for interpretability and confidence assessment. However, many current approaches fall short in at least one of these aspects. Drawing on advances in deep learning and scalable probabilistic modeling, we propose a new deep sequential variational autoencoder approach for dimensionality reduction and data imputation. Temporal dependencies are modeled with a Gaussian process prior and a Cauchy kernel to reflect multi-scale dynamics in the latent space. We furthermore use a structured variational inference distribution that improves the scalability of the approach. We demonstrate that our model exhibits superior imputation performance on benchmark tasks and challenging real-world medical data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Time series are often associated with missing values, for instance due to faulty measurement devices, partially observed states, or costly measurement procedures [15]. These missing values impair the usefulness and interpretability of the data, leading to the problem of data imputation

: estimating those missing values from the observed ones

[38].

Multivariate time series, consisting of multiple correlated univariate time series or channels, give rise to two distinct ways of imputing missing information: (1) by exploiting temporal correlations within each channel, and (2) by exploiting correlations across channels, for example by using lower-dimensional representations of the data. For instance in a medical setting, if the blood pressure of a patient is unobserved, it can be informative that the heart rate at the current time is higher than normal and that the blood pressure was also elevated an hour ago. An ideal imputation model for multivariate time series should therefore take both of these sources of information into account. Another desirable property of such models is to offer a probabilistic interpretation, allowing for uncertainty estimation.

Unfortunately, current imputation approaches fall short with respect to at least one of these desiderata. While there are many time-tested statistical methods for multivariate time series analysis (e.g., Gaussian processes

[37]) that work well in the case of complete data, these methods are generally not applicable when features are missing. On the other hand, classical methods for time series imputation often do not take the potentially complex interactions between the different channels into account [28, 34]. Finally, recent work has explored the use of non-linear dimensionality reduction using variational autoencoders for i.i.d. data points with missing values [33, 1, 30] , but this work has not considered temporal data and strategies for sharing statistical strength across time.

Following these considerations, it is promising to combine non-linear dimensionality reduction with an expressive time series model. This can be done by jointly learning a mapping from the data space (where features are missing) into a latent space (where all dimensions are fully determined). A statistical model of choice can then be applied in this latent space to model temporal dynamics. If the dynamics model and the mapping for dimensionality reduction are both differentiable, the approach can be trained end-to-end.

In this paper, we propose an architecture that uses deep variational autoencoders (VAEs) to map the missing data time series into a latent space without missingness, where we model the low-dimensional dynamics with a Gaussian process (GP). As we will discuss below, we hereby propose a prior model that efficiently operates at multiple time scales, taking into account that the multivariate time series may have different channels (e.g., heart rate, blood pressure, etc.) that change with different characteristic frequencies. Finally, our variational inference approach makes use of efficient structured variational approximations, where we fit another multivariate Gaussian process in order to approximate the intractable true posterior.

We make the following contributions:

  • A new model. We propose a VAE architecture for multivariate time series imputation with a GP prior in the latent space to capture temporal dynamics. We propose a Cauchy kernel to allow the time series to display dynamics at multiple scales in a reduced dimensionality.

  • Efficient inference. We use a structured variational approximation that models posterior correlations in the time domain. By construction, inference is efficient and the time complexity for sampling from the variational distribution, used for training, is linear in the number of time steps (as opposed to cubic when done naïvely).

  • Benchmarking on real-world data. We carry out extensive comparisons to classical imputation methods as well as state-of-the-art deep learning approaches, and perform experiments on data from two different domains. Our method outperforms the baselines in both cases.

We start by reviewing the related literature in Sec. 2, describe the general setting in Sec. 3.1 and introduce our model and inference scheme in Sec. 3.2 and Sec. 3.3, respectively. Experiments and conclusions are presented in Sec. 4 and 5.

2 Related work

Classical statistical approaches.

The problem of missing values has been a long-standing challenge in many time series applications, especially in the field of medicine [34]

. The earliest approaches to deal with this problem often relied on heuristics, such as mean imputation or forward imputation. Despite their simplicity, these methods are still widely applied today due to their efficiency and interpretability

[15]

. Orthogonal to these ideas, methods along the lines of expectation-maximization (EM) have been proposed, but they often require additional modeling assumptions

[4].

Bayesian methods.

When it comes to estimating likelihoods and uncertainties relating to the imputations, Bayesian methods, such as Gaussian processes (GPs) [35], have a clear advantage over non-Bayesian methods such as single imputation [28]. There has been much recent work in making these methods more expressive and incorporating prior knowledge from the domain (e.g., medical time series) [41, 9] or adapting them to work on discrete domains [10], but their wide-spread adoption is hindered by their limited scalability and the challenges in designing kernels that are robust to missing values.

Deep learning techniques.

Another avenue of research in this area uses deep learning techniques, such as variational autoencoders (VAEs) [33, 1, 30, 8, 32] or generative adversarial networks (GANs) [42, 25]. It should be noted that VAEs allow for tractable likelihoods, while GANs generally do not and have to rely on additional optimization processes to find latent representations of a given input [27]. Unfortunately, none of these models explicitly take the temporal dynamics of time series data into account. Conversely, there are deep probabilistic models for time series [e.g., 22, 23, 11], but those do not explicitly handle missing data. There are also some VAE-based imputation methods that are designed for a setting where the data is complete at training time and the missingness only occurs at test time [12, 13, 17]. We do not regard this setting in our work.

Hi-Vae.

Our approach borrows some ideas from the HI-VAE [33]. This model deals with missing data by defining an ELBO whose reconstruction error term only sums over the observed part of the data. For inference, the incomplete data are filled with arbitrary values (e.g., zeros) before they are fed into the inference network, which induces an unavoidable bias. The main difference to our approach is that the HI-VAE was not formulated for sequential data and therefore does not exploit temporal information in the imputation task.

Deep learning for time series imputation.

While the mentioned deep learning approaches are very promising, most of them do not take the time series nature of the data directly into account, that is, they do not model the temporal dynamics of the data when dealing with missing values. To the best of our knowledge, the only deep learning model for missing value imputation that does account for the time series nature of the data is the GRUI-GAN [29], which we describe in Sec. 4. We will show that our approach outperforms this baseline on our considered data sets.

Other related work.

Our proposed model combines several ideas from the domains of Bayesian deep learning and classical probabilistic modeling; thus, removing elements from our model naturally relates to other approaches. For example, removing the latent GP for modeling dynamics as well as our proposed structured variational distribution results in the HI-VAE [33] described above. Furthermore, our idea of using a latent GP in the context of a deep generative model bears similarities to the GPPVAE [7], but note that the GPPVAE was not proposed to model time series data and does not take missing values into account. Lastly, the GP prior with the Cauchy kernel is reminiscent of Jähnichen et al. [18] and the structured variational distribution is similar to the one used by Bamler and Mandt [3] in the context of modeling word embeddings over time. Neither of these two works, however, considered amortized inference or VAEs.

3 Model

Architecture sketch

Graphical model
Figure 1: Overview of our proposed model with a convolutional inference network, a deep feed-forward generative network and a Gaussian process prior in the latent space. The several CNN blocks as well as the MLP blocks are each sharing their respective parameters. The Gaussian process has mean function and kernel function .

We propose a novel architecture for missing value imputation, an overview of which is depicted in Figure 1. Our model can be seen as a way to perform amortized approximate inference on a latent Gaussian process model.

The main idea of our proposed approach is to embed the data into a latent space of reduced dimensionality, in which every dimension is fully determined, and then model the temporal dynamics in this latent space. Since many features in the data might be correlated, the latent representation captures these correlations and uses them to reconstruct the missing values. Moreover, the GP prior in the latent space encourages the model to embed the data into a representation in which the temporal dynamics are smoother and more easily explainable than in the original data space. Finally, the structured variational distribution of the inference network allows the model to integrate temporal information into the representations, such that the reconstructions of missing values can not only be informed by correlated observed features at the same time point, but also by the rest of time series.

Specifically, we combine ideas from VAEs [21], GPs [35], Cauchy kernels [18], structured variational distributions with efficient inference [3], and a special ELBO for missing data [33] and synthesize these ideas into a general framework for missing data imputation on time series. In the following, we will outline the problem setting, describe the assumed generative model, and derive our proposed inference scheme.

3.1 Problem setting and notation

We assume a data set with data points . Let us assume that the data points were measured at consecutive time points with . By convention, we usually set . The data can thus be viewed as a time series of length in time.

We moreover assume that any number of these data features can be missing, that is, that their values can be unknown. We can now partition each data point into observed and unobserved features. The observed features of data point are . Equivalently, the missing features are with .

We can now use this partitioning to define the problem of missing value imputation. Missing value imputation describes the problem of estimating the true values of the missing features given the observed features . Many methods assume the different data points to be independent, in which case the inference problem reduces to separate problems of estimating . In the time series setting, this independence assumption is not satisfied, which leads to the more complex estimation problem of .

3.2 Generative model

In this subsection, we describe the details of our proposed approach of reducing the observed time series with missing data into a representation without missingness, and modeling dynamics in this lower-dimensional representation using Gaussian processes. Yet, it is tempting to try to skip the step of dimensionality reduction and instead directly try to model the incomplete data in the observed space using GPs. We argue that this is not practical for several reasons.

Gaussian processes are well suited for time series modeling [37]

and offer many advantages, such as data-efficiency and calibrated posterior probabilities. However, they come at the cost of inverting the kernel matrix, which has a time complexity of

. Moreover, designing a kernel function that accurately captures correlations in feature space and also in the temporal dimension is difficult.

This problem becomes even worse if certain observations are missing. One option is to fill the missing values with some numerical value (e.g., zero) to make the kernel computable. However, this arbitrary filling may make two data points with different missingness patterns look very dissimilar when in fact they are close to each other in the ground-truth space. Another alternative is to treat every channel of the multivariate time series separately and let the GP infer missing values, but this ignores valuable correlations across channels.

In this work, we overcome the problem of defining a suitable GP kernel in the data space with missing observations by instead applying the GP in the latent space of a variational autoencoder where the encoded feature representations are complete. That is, we assign a latent variable for every , and model temporal correlations in this reduced representation using a GP, . This way, we decouple the step of filling in missing values and capturing instantaneous correlations between the different feature dimensions from modeling dynamical aspects. The graphical model is depicted in Figure 1.

A remaining practical difficulty that we encountered is that many multivariate time series display dynamics at multiple time scales. One of our main motivations is to model time series that arise in medical setups where doctors measure different patient variables and vital signs, such as heart rate, blood pressure, etc. When using conventional GP kernels (e.g., the RBF kernel, ), one implicitly assumes a single time scale of relevance (). We found that this choice does not reflect the dynamics of medical data very well.

In order to model data that varies at multiple time scales, we consider a mixture of RBF kernels with different ’s [35]

. By defining a Gamma distribution over the length scale, that is,

, we can compute an infinite mixture of RBF kernels,

This yields the so-called Rational Quadratic kernel [35]. For and , it reduces to the Cauchy kernel

(1)

which has previously been successfully used in the context of robust dynamic topic modeling where similar multi-scale time dynamics occur [18]. We therefore choose this kernel for our Gaussian process prior.

Given the latent time series , the observations are generated time-point-wise by

(2)

where

is a potentially nonlinear function parameterized by the parameter vector

. Considering the scenario of a medical time series, can be thought of as the latent physiological state of the patient and would be the process of generating observable measurements (e.g., heart rate, blood pressure, etc.) from that physiological state. In our experiments, the function

is implemented by a deep neural network.

3.3 Inference model

In order to learn the parameters of the deep generative model described above, and in order to efficiently infer its latent state, we are interested in the posterior distribution . Since the exact posterior is intractable, we use variational inference [19, 6, 43]. Furthermore, to avoid inference over per-datapoint (local) variational parameters, we apply inference amortization [21]. To make our variational distribution more expressive and capture the temporal correlations of the data, we employ a structured variational distribution [40] with efficient inference that leads to an approximate posterior which is also a GP.

We approximate the true posterior with a multivariate Gaussian variational distribution

(3)

where indexes the dimensions in the latent space. Our approximation implies that our variational posterior is able to reflect correlations in time, but breaks dependencies across the different dimensions in -space (which is typical in VAE training [21, 36]).

We choose the variational family to be the family of multivariate Gaussian distributions in the time domain, where the precision matrix

is parameterized in terms of a product of bidiagonal matrices,

(4)

Above, the ’s are local variational parameters and is an upper triangular band matrix. Similar structured distributions were also employed by Blei and Lafferty [5], Bamler and Mandt [2].

This parameterization automatically leads to being positive definite, symmetric, and tridiagonal. Samples from can thus be generated in linear time in [16, 31, 3] as opposed to the cubic time complexity for a full-rank matrix. Moreover, compared to a fully factorized variational approximation, the number of variational parameters are merely doubled. Note that while the precision matrix is sparse, the covariance matrix can still be dense, allowing to reflect long-range dependencies in time.

Instead of optimizing and separately for every data point, we amortize the inference through an inference network with parameters that computes the variational parameters based on the inputs as . In the following, we accordingly denote the variational distribution as . Following standard VAE training, the parameters of the generative model and of the inference network can be jointly trained by optimizing the evidence lower bound (ELBO),

(5)

Following Nazabal et al. [33] (see Sec. 2), we evaluate the ELBO only on the observed features of the data since the remaining features are unknown, and set these missing features to a fixed value (zero) during inference. Our training objective is thus the RHS of (5).

Neural network architectures.

We use a convolutional neural network (CNN) as an inference network and a fully connected multilayer perceptron (MLP) as a generative network. The inference network convolves over the time dimension of the input data and allows for sequences of variable lengths. It consists of a number of convolutional layers that integrate information from neighboring time steps into a joint representation using a fixed receptive field (see Figure 

1

). The CNN outputs a tensor of size

, where is the dimensionality of the latent space. Every row corresponds to a time step and contains parameters, which are used to predict the mean vector as well as the diagonal and off-diagonal elements that characterize at the given time step. More details about the network structure are given in the appendix (Sec. A).

4 Experiments

We performed experiments on the benchmark data set Healing MNIST [22], which combines the classical MNIST data set [24] with properties common to medical time series, the SPRITES data set [26], and on a real-world medical data set from the 2012 Physionet Challenge [39]. We compared our model against conventional single imputation methods [28], GP-based imputation [35], VAE-based methods that are not specifically designed to handle temporal data [21, 33], and modern state-of-the-art deep learning methods for temporal data imputation [29]. We found strong quantitative and qualitative evidence that our proposed model outperforms the baseline methods in terms of imputation quality on all three tasks. In the following, we are first going to give an overview of the baseline methods and then present our experimental findings. Details about the data sets and neural network architectures can be found in the appendix (Sec. A).

4.1 Baseline methods

Forward imputation and mean imputation.

Forward and mean imputation are so-called single imputation methods, which means that they do not attempt to fit a distribution over possible values for the missing features, but only predict one estimate [28]. Forward imputation always predicts the last observed value for any given feature, while mean imputation predicts the mean of all the observations of the feature in a given time series.

Gaussian process in data space.

One option to deal with missingness in multivariate time series is to fit independent Gaussian processes to each channel. As discussed previously (Sec. 3.2), this ignores the correlation between channels. The missing values are then imputed by taking the mean of the respective posterior of the GP for that feature.

VAE and HI-VAE.

The VAE [21] and HI-VAE [33] are fit to the data using the same training procedure as the proposed GP-VAE model. The VAE uses a standard ELBO that is defined over all the features, while the HI-VAE uses the ELBO from (5), which is only evaluated on the observed part of the feature space. During inference, missing features are filled with constant values, such as zero.

Grui-Gan.

The GRUI-GAN [29]

uses a recurrent neural network (RNN), namely a gated recurrent unit (GRU). Once the network is trained, a time series is imputed by optimizing the latent vector in the input space of the generator, such that the generator’s output on the observed features is closest to the true values.

4.2 Healing MNIST

Time series with missing values play a crucial role in the medical field, but are often hard to obtain. Krishnan et al. [22] generated a data set called Healing MNIST, which is designed to reflect many properties that one also finds in real medical data. We benchmark our method on a variant of this data set. It was designed to incorporate some properties that one also finds in real medical data, and consists of short sequences of moving MNIST digits [24]

that rotate randomly between frames. The analogy to healthcare is that every frame may represent the collection of measurements that describe a patient’s health state, which contains many missing measurements at each moment in time. The temporal evolution represents the non-linear evolution of the patient’s health state. The image frames contain around 60 % missing pixels and the rotations between two consecutive frames are normally distributed.

The benefit of this data set is that we know the ground truth of the imputation task. We compare our model against a standard VAE (no latent GP and standard ELBO over all features), the HI-VAE [33], as well as mean imputation and forward imputation. The models were trained on time series of digits from the Healing MNIST training set (50,000 time series) and tested on digits from the Healing MNIST test set (10,000 time series). Negative log likelihoods on the ground truth values of the missing pixels and mean squared errors (MSE) are reported in Table 1, and qualitative results shown in Figure 2

. To assess the usefulness of the imputations for downstream tasks, we also trained a linear classifier on the imputed MNIST digits to predict the digit class and measured its performance in terms of area under the receiver-operator-characteristic curve (AUROC) (Tab. 

1).

Healing MNIST SPRITES
Model NLL MSE AUROC NLL MSE
Mean imputation [28] - 0.178 0.000 0.787 - 0.013 0.000
Forward imputation [28] - 0.174 0.000 0.779 - 0.028 0.000
VAE [21] 0.4798 0.0016 0.152 0.001 0.738 -0.3702 0.0140 0.034 0.000
HI-VAE [33] 0.2896 0.0010 0.088 0.000 0.811 -0.3309 0.0126 0.035 0.000
GP-VAE (proposed) 0.2606 0.0008 0.078 0.000 0.826 -1.9595 0.0011 0.002 0.000
Table 1:

Performance of the different models on the Healing MNIST test set and the SPRITES test set in terms of negative log likelihood [NLL] and mean squared error [MSE] (lower is better), as well as downstream classification performance [AUROC] (higher is better). The reported values are means and their respective standard errors over the test set. The proposed model outperforms all the baselines.

Figure 2: Reconstructions from Healing MNIST and SPRITES. The GP-VAE (proposed) is stable over time and yields the highest fidelity. Table 2: Performance of the different models on the Physionet test set in terms of AUROC of a linear SVM trained on the imputed time series. We observe that the proposed model outperforms the competitors. Model AUROC Mean imputation [28] 0.502 Forward imputation [28] 0.552 GP [35] 0.576 VAE [21] 0.588 HI-VAE [33] 0.595 GRUI-GAN [29] 0.595 GP-VAE (proposed) 0.617

Our approach outperforms the baselines in terms of likelihood and MSE. The reconstructions (Fig. 2) reveal the benefits of the GP-VAE approach: related approaches yield unstable reconstructions over time, while our approach offers more stable reconstructions, using temporal information from neighboring frames. Moreover, our model also yields the most useful imputations for downstream classification in terms of AUROC. The downstream classification performance correlates well with the test likelihood on the ground truth data, supporting the intuition that it is a good proxy measure in cases where the ground truth likelihood is not available.

4.3 SPRITES data

To assess our model’s performance on more complex data, we applied it to the SPRITES data set, which has previously been used with sequential autoencoders [26]. The dataset consists of 9,000 sequences of animated characters with different clothes, hair styles, and skin colors, performing different actions. Each frame has a size of pixels and each time series features 8 frames. We again introduced about 60 % of missing pixels and compared the same methods as above. The results are reported in Table 1 and example reconstructions are shown in Figure 2. As in the previous experiment, our model outperforms the baselines in terms of likelihood and MSE and also yields the most convincing reconstructions. The HI-VAE seems to suffer from posterior collapse in this setting, which might be due to the large dimensionality of the input data.

4.4 Real medical time series data

We also applied our model to the data set from the 2012 Physionet Challenge [39]. The data set contains 12,000 patients which were monitored on the intensive care unit (ICU) for 48 hours each. At each hour, there is a measurement of 36 different variables (heart rate, blood pressure, etc.), any number of which might be missing. We again compare our model against the standard VAE and HI-VAE, as well as a GP fit feature-wise in the data space and the GRUI-GAN model [29], which reported state-of-the-art imputation performance.

The main challenge is the absence of ground truth data for the missing values. This cannot easily be circumvented by introducing additional missingness since (1) the mechanism by which measurements were omitted is not random, and (2) the data set is already very sparse with about 90 % of the features missing. To overcome this issue, Luo et al. [29] proposed a downstream task as a proxy for the imputation quality. They chose the task of mortality prediction, which was one of the main tasks of the Physionet Challenge on this data set, and measured the performance in terms of AUROC. In this paper, we adopt this measure.

For sake of interpretability, we used a linear support vector machine (SVM) as a downstream classification model. This model tries to optimally separate the whole time series in the input space using a linear hyperplane. The choice of model follows the intuition that under a perfect imputation similar patients should be located close to each other in the input space, while that is not necessarily the case when features are missing, or when the imputation is poor. Note that it is unrealistic to ask for high accuracies in this task, as the clean data are unlikely to be perfectly separable. As seen in Table 

1, this proxy measure correlates well with the ground truth likelihood.

The performances of the different methods under this measure are reported in Table 4.2. Our model outperforms all baselines, including the GRUI-GAN, which provides strong evidence that our model is well suited for real-world medical time series imputations.

5 Conclusion

We presented a deep probabilistic model for multivariate time series imputation, where we combined ideas from variational autoencoders and Gaussian processes. The VAE maps the missing data from the input space into a latent space where every dimension is completely determined. The GP then models the temporal dynamics in this latent space. To flexibly model dynamics on different time scales, we use a Cauchy kernel for the latent GP prior. Moreover, we use structured variational inference to approximate the latent GP posterior, which reflects the temporal correlations of the data more accurately than a fully factorized approximation. At the same time, inference in our variational distribution is still efficient, as opposed to inference in the full GP posterior.

We empirically validated our proposed model on Healing MNIST benchmark data, SPRITES data, and real-world medical time series data from the 2012 Physionet Challenge. We observe that our model outperforms classical baselines as well as modern deep learning approaches on these tasks. This suggests that the model can successfully learn the temporal dynamics of real-world processes, even under high missingness rates.

In future work, it would be interesting to assess the applicability of the model to other data domains (e.g., natural videos) and explore a larger variety of kernel choices (e.g., learned kernels) for the latent GP. Moreover, it could be a fruitful avenue of research to choose more sophisticated neural network architectures for the inference model and the generative model. The inference network could for instance use an architecture that factorizes across features [30] or groups of features [1]

, in order to handle missing values even more flexibly. The generative network, on the other hand, could be extended with an autoregressive model

[14], in order to improve the coherence of the output time series even further and the tightness of the ELBO could be improved by importance weighting [32].

References

Appendix

Appendix A Implementation details

a.1 Healing MNIST

Hyperparameter Value
Number of CNN layers in inference network 1
Number of filters per CNN layer 256
Filter size (i.e., time window size) 10
Number of feedforward layers in inference network 2
Width of feedforward layers 256
Dimensionality of latent space 256
Length scale of Cauchy kernel 1.0
Number of feedforward layers in generative network 3
Width of feedforward layers 256
Activation function of all layers ReLU
Learning rate during training 0.0001
Optimizer Adam [20]

Number of training epochs

20
Train/val/test split of data set 50,000/10,000/10,000
Dimensionality of time points 784
Length of time series 10
Table S1: Hyperparameters used in the VAE, HI-VAE and GP-VAE models for the experiment on Healing MNIST. Some of the parameters are only relevant in a subset of the models.

a.2 Sprites

Hyperparameter Value
Number of CNN layers in inference network 3
Number of filters per CNN layer 1
Filter size (i.e., time window size) 10
Number of feedforward layers in inference network 2
Width of feedforward layers 256
Dimensionality of latent space 256
Length scale of Cauchy kernel 1.0
Number of feedforward layers in generative network 3
Width of feedforward layers 256
Activation function of all layers ReLU
Learning rate during training 0.0005
Optimizer Adam [20]
Number of training epochs 30
Train/val/test split of data set 8,000/1,000/1,000
Dimensionality of time points 12288
Length of time series 8
Table S2: Hyperparameters used in the VAE, HI-VAE and GP-VAE models for the experiment on SPRITES. Some of the parameters are only relevant in a subset of the models.

a.3 Real medical time series data

Hyperparameter Value
Number of CNN layers in inference network 1
Number of filters per CNN layer 32
Filter size (i.e., time window size) 10
Number of feedforward layers in inference network 1
Width of feedforward layers 32
Dimensionality of latent space 32
Length scale of Cauchy kernel 1.0
Number of feedforward layers in generative network 2
Width of feedforward layers 32
Activation function of all layers ReLU
Learning rate during training 0.0005
Optimizer Adam [20]
Number of training epochs 20
Train/val/test split of data set 4,000/4,000/4,000
Dimensionality of time points 36
Length of time series 48
Table S3: Hyperparameters used in the VAE, HI-VAE and GP-VAE models for the experiment on medical time series from the Physionet data set. Some of the parameters are only relevant in a subset of the models.