Variational PSOM: Deep Probabilistic Clustering with Self-Organizing Maps

10/03/2019
by   Laura Manduchi, et al.
29

Generating visualizations and interpretations from high-dimensional data is a common problem in many fields. Two key approaches for tackling this problem are clustering and representation learning. There are very performant deep clustering models on the one hand and interpretable representation learning techniques, often relying on latent topological structures such as self-organizing maps, on the other hand. However, current methods do not yet successfully combine these two approaches. We present a new deep architecture for probabilistic clustering, VarPSOM, and its extension to time series data, VarTPSOM. We show that they achieve superior clustering performance compared to current deep clustering methods on static MNIST/Fashion-MNIST data as well as medical time series, while inducing an interpretable representation. Moreover, on the medical time series, VarTPSOM successfully predicts future trajectories in the original data space.

READ FULL TEXT VIEW PDF

Authors

page 9

page 14

page 15

06/06/2018

Deep Self-Organization: Interpretable Discrete Representation Learning on Time Series

Human professionals are often required to make decisions based on comple...
05/14/2019

A self-organising eigenspace map for time series clustering

This paper presents a novel time series clustering method, the self-orga...
06/01/2021

Unsupervised Representation Learning for Time Series with Temporal Neighborhood Coding

Time series are often complex and rich in information but sparsely label...
11/19/2021

Unsupervised Visual Time-Series Representation Learning and Clustering

Time-series data is generated ubiquitously from Internet-of-Things (IoT)...
08/26/2021

SOMTimeS: Self Organizing Maps for Time Series Clustering and its Application to Serious Illness Conversations

There is an increasing demand for scalable algorithms capable of cluster...
04/17/2013

Unsupervised model-free representation learning

Numerous control and learning problems face the situation where sequence...
03/30/2019

On Arrhythmia Detection by Deep Learning and Multidimensional Representation

ECG is a time-series signal that is represented by 1-D data. Higher dime...

Code Repositories

SOM-VAE

TensorFlow implementation of the SOM-VAE model as described in https://arxiv.org/abs/1806.02199


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Information visualization techniques are essential in areas where humans have to make decisions based on large amounts of complex data. Their goal is to find an interpretable representation of the data that allows the integration of humans into the data exploration process. This encourages visual discoveries of relationships in the data and provides guidance to downstream tasks. In this way, a much higher degree of confidence in the findings of the exploration is attained [981847]. An interpretable representation of the data, in which the underlying factors are easily visualized, is particularly important in domains where the reason for obtaining a certain prediction is as valuable as the prediction itself. However, finding a meaningful and interpretable representation of complex data can be challenging.

Clustering is one of the most natural ways for retrieving interpretable information from raw data. Long-established methods such as k-means

[macqueen1967]

and Gaussian Mixture Models

[Bishop:2006:PRM:1162264]

represent the cornerstone of cluster analysis. Their applicability, however, is often constrained to simple data and their performance is limited in high-dimensional, complex, real-world data sets, which do not exhibit a clustering-friendly structure.

Deep generative models have recently achieved tremendous success in representation learning. Some of the most commonly used and efficient approaches are Autoencoders (AEs), Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs)

[2013arXiv1312.6114K, 2014arXiv1406.2661G]. The compressed latent representation generated by these models has been proven to ease the clustering process [DBLP:journals/corr/abs-1801-07648]

. As a result, the combination of deep generative models for feature extraction and clustering results in a dramatic increase of the clustering performance

[DBLP:journals/corr/XieGF15]. Although very successful, most of these methods do not investigate the relationship among clusters and the clustered feature points live in a high-dimensional latent space that cannot be easily visualized or interpreted by humans.

The Self-Organizing Map (SOM) [58325] is a clustering method that provides such an interpretable representation. It arranges the obtained centroids in a topologically meaningful order, inducing a flexible neighbourhood structure. If the chosen topological structure is a 2-dimensional grid, it facilitates visualization. Alas, its applicability is often constrained to simple data sets similar to other classical clustering methods.

To resolve the above issues, we propose a novel deep architecture, the Variational Probabilistic SOM (VarPSOM), that jointly trains a VAE and a SOM to achieve an interpretable discrete representation while exhibiting state-of-the-art clustering performance. Instead of hard assignment of data points to clusters, our model uses a centroid-based probability distribution. It minimizes its Kullback-Leibler divergence against an auxiliary target distribution, while enforcing a SOM-friendly space. To highlight the importance of an interpretable representation for different purposes, we extended this model to deal with temporal data, yielding VarTPSOM. We discuss related work in Section

2. Extensive evidence of the superior clustering performance of both models, on MNIST/Fashion-MNIST images as well as real-world medical time series is presented in Section 4.

Our main contributions are:

  • A novel architecture for deep clustering, yielding an interpretable discrete representation through the use of a probabilistic self-organizing map.

  • An extension of this architecture to time series, improving clustering performance on this data type and enabling temporal predictions.

  • A thorough empirical assessment of our proposed models, showing superior performance on benchmark tasks and challenging medical time series from the intensive care unit.

2 Related Work

Self-Organizing Maps have been widely used as a means to visualize information from large amounts of data [7008682] and as a form of clustering in which the centroids are connected by a topological neighborhood structure [10.1007/978-3-540-48247-5_9]. Since their early inception, several variants have been proposed to enhance their performance and scope. The adaptive subspace SOM, ASSOM [Kohonen1995TheAS], for example, proposed to combine PCA and SOMs to map data into a reduced feature space. [TOKUNAGA200982]

combine SOMs with multi-layer perceptrons to obtain a modular network.

[7280357]

proposed Deep SOM (DSOM), an architecture composed of multiple layers similar to Deep Neural Networks. There exist several methods tailored to representation learning on time series, among them

[franceschi2019unsupervised, fortuin2019deep, fortuin2019multivariate], which are however not based on SOMs. Extensions of SOM optimized for temporal data include the Temporal Kohonen map [Chappell:1993:TKM:154879.154890] and its improved version Recurrent SOM [10.1007/978-3-540-45240-9_1] as well as Recursive SOM [VOEGTLIN2002979]

. While SOM and its variants are particularly effective for data visualization

[7280357], it was rarely attempted to combine their merits in this respect with modern state-of-the-art clustering methods, which often use deep generative models in combination with probabilistic clustering.

In particular, recent works on clustering analysis have shown that combining clustering algorithms with the latent space of AEs greatly increases the clustering performance [DBLP:journals/corr/abs-1801-07648]. [DBLP:journals/corr/XieGF15] proposed DEC, a method that sequentially applies embedding learning using Stacked Autoencoders (SAE), and the Clustering Assignment Hardening method on the obtained representation. An improvement of this architecture, IDEC [ijcai2017-243], includes the decoder network of the SAE in the learning process, so that training is affected by both the clustering loss and the reconstruction loss. Similarly, DCN [DBLP:journals/corr/YangFSH16] combines a k-means clustering loss with the reconstruction loss of SAE to obtain an end-to-end architecture that jointly trains representations and clustering. These models achieve state-of-the-art clustering performance but they do not investigate the relationship among clusters. An exception is the work by [DBLP:journals/corr/abs-1803-05206], in which they present an unsupervised method that learns latent embeddings and discovers multi-facet clustering structure. Relationships among clusters were discovered, however, they do not provide a latent space that can be easily interpreted and which eases the process of analytical reasoning.

To the best of our knowledge, only two models used deep generative models in combination with a SOM structure in the latent space. The SOM-VAE model [DBLP:journals/corr/abs-1806-02199], inspired by the VQ-VAE architecture [DBLP:journals/corr/abs-1711-00937]

, uses an AE to embed the input data points into a latent space and then applies a SOM-based clustering loss on top of this latent representation. It features hard assignments of points to centroids, as well as the use of a Markov model for temporal data, both of which yield inferior expressivity compared to our method. The Deep Embedded SOM, DESOM

[inproceedings], improved the previous model by using a Gaussian neighborhood window with exponential radius decay and by learning the SOM structure in a continuous setting. Both methods feature a topologically interpretable neighborhood structure and yield promising results in visualizing state spaces. However their clustering quality is limited by the absence of techniques used in state-of-the-art clustering methods like IDEC or DCN.

3 Probabilistic clustering with Variational PSOM

Given a set of data samples , where , the goal is to partition the data into a set of clusters , while retaining a topological structure over the cluster centroids.

The proposed architecture for static data is presented in Figure (a)a

. The input vector

is embedded into a latent representation using a VAE. This latent vector is then clustered using PSOM, a new SOM clustering strategy that extends the Clustering Assignment Hardening method [DBLP:journals/corr/XieGF15]

. The VAE and PSOM are trained jointly to learn a latent representation with the aim to boost the clustering performance. To prevent the network from outputting a trivial solution, the decoder network reconstructs the input from the latent embedding, encouraging it to be as similar as possible to the original input. The obtained loss function is a linear combination of the clustering loss and the reconstruction loss. To deal with temporal data, we propose another model variant, which is depicted in Figure

(b)b.

(a) VarPSOM architecture for clustering of static data. Data points are mapped to a continuous embedding using a VAE (parameterized by ). The loss function is the sum of a SOM-based clustering loss and the ELBO.
(b) VarTPSOM architecture, composed of VarPSOM modules connected by LSTMs across the time axis, which predict the continuous embedding of the next time step. This architecture allows to unroll future trajectories in the latent space as well as the original data space by reconstructing the using the VAE.
Figure 1: Model architectures of (a) VarPSOM and (b) VarTPSOM.

3.1 Background

A Self-Organizing Map is comprised of nodes connected to form a grid , where the node , at position of the grid, corresponds to a centroid vector, in the input space. The centroids are tied by a neighborhood relation . Given a random initialization of the centroids, the SOM algorithm randomly selects an input and updates both its closest centroid and its neighbors to move them closer to . For a complete description of the SOM algorithm, we refer to the appendix (A).

The Clustering Assignment Hardening method has been recently introduced by the DEC model [DBLP:journals/corr/XieGF15] and was shown to perform well in the latent space of AEs [DBLP:journals/corr/abs-1801-07648]. Given an embedding function , it uses a Student’s t-distribution () as a kernel to measure the similarity between an embedded data point , and a centroid :

It improves the cluster purity by enforcing the distribution to approach a target distribution, :

By taking the original distribution to the power of and normalizing it, the target distribution puts more emphasis on data points that are assigned a high confidence. We follow [DBLP:journals/corr/XieGF15] in choosing =2, which leads to larger gradient contributions of points close to cluster centers, as they show empirically. The resulting clustering loss is defined as:

(1)

3.2 Probabilistic SOM (PSOM) clustering

Our proposed clustering method, called PSOM, expands Clustering Assignment Hardening to include a SOM neighborhood structure over the centroids. We add an additional loss to (1) to achieve an interpretable representation. This loss term maximizes the similarity between each data point and the neighbors of the closest centroids. For each embedded data point and each centroid the loss is defined as the negative sum of all the neighbors of , , of the probability that is assigned to , defined as . This sum is weighted by the similarity between and the centroid :

The complete PSOM clustering loss is then:

We note that for it becomes equivalent to Clustering Assignment Hardening.

3.3 VarPSOM: VAE for feature extraction

In our method, the nonlinear mapping between the input and embedding is realized by a VAE. Instead of directly embedding the input into a latent embedding , the VAE learns a probability distribution

parametrized as a multivariate normal distribution whose mean and variance are

. Similarly, it also learns the probability distribution of the reconstructed output given a sampled latent embedding, where . Both and are neural networks, respectively called encoder and decoder. The ELBO loss is:

(2)

where is an isotropic Gaussian prior over the latent embeddings. The second term can be interpreted as a form of regularization, which encourages the latent space to be compact. For each data point the latent embedding is sampled from . Adding the ELBO loss to the PSOM loss from the previous subsection, we get the overall loss function of VarPSOM:

(3)

To the best of our knowledge, no previous SOM methods attempted to use a VAE to embed the inputs into a latent space. There are many advantages of a VAE over an AE for realizing our goals. Most importantly, learning a probability distribution over the embedding space improves interpretability of the model. For example, points with a higher variance in the latent space could be identified as potential outliers and therefore treated as less accurate and trustworthy. Moreover, the regularization term of the VAE prevents the network from scattering the embedded points discontinuously in the latent space, which naturally facilitates the fitting of the SOM. To test if the use of CNNs can boost clustering performance on image data, we introduce another model variant called VarCPSOM, which uses convolutional filters as part of the VAE.

3.4 VarTPSOM: Extension to time series data

To extend our proposed model to time series data, we add a temporal component to the architecture. Given a set of time series of length , , the goal is to learn interpretable trajectories on the SOM grid. To do so, the VarPSOM could be used directly but it would treat each time step of the time series independently, which is undesirable. To exploit temporal information and enforce smoothness in the trajectories, we add an additional loss to (3):

(4)

where using a Student’s t-distribution and refers to the embedding of time series at time index . It maximizes the similarity between latent embeddings of adjacent time steps, such that large jumps in the latent state between time points are discouraged.

One of the main goals in time series modeling is to predict future data points, or alternatively, future embeddings. This can be achieved by adding a long short-term memory network (LSTM) across the latent embeddings of the time series, as shown in Fig

(b)b. Each cell of the LSTM takes as input the latent embedding at time step , and predicts a probability distribution over the next latent embedding, . We parametrize this distribution as a Multivariate Normal Distribution whose mean and variance are learnt by the LSTM. The prediction loss is the log-likelihood between the learned distribution and a sample of the next embedding :

(5)

The final loss of VarTPSOM, which is trainable in a fully end-to-end fashion, is

(6)

4 Experiments

First, we evaluate VarPSOM and VarCPSOM and compare them with state-of-the-art non-interpretable as well as SOM-based clustering methods on MNIST [726791] and Fashion-MNIST [DBLP:journals/corr/abs-1708-07747] data. Here, particular focus is laid on the comparison of VarPSOM and the clustering models DEC and IDEC, to investigate the role of the VAE and the SOM loss. We then present visualizations of the obtained 2D representations, to illustrate how our method could ease visual reasoning about the data. Finally, we present extensive evidence of the performance of VarTPSOM on real-world complex time series from the eICU data set [pollard2018eicu], and illustrate how it allows visualization of patient health state trajectories in an easily understandable 2D domain. For details on the data sets, we refer to the appendix (B.1). Code to train our models and reproduce the results is available at https://github.com/ratschlab/variational-psom.

Baselines

We used two different types of baselines. The first category contains clustering methods that do not provide any interpretable discrete latent representation. Those include k-means, the DEC model, as well as its improved version IDEC, whose clustering methods are related to ours. We also include a modified version of IDEC that we call VarIDEC, in which we substitute the AE with a VAE, to investigate the role of the VAE in our method. For all these methods we use clusters. In the second category, we include state-of-the-art clustering methods based on SOMs. Here, we used a standard SOM (minisom), AE+SOM, an architecture composed of an AE and a SOM applied on top of the latent representation (trained sequentially), SOM-VAE and DESOM. For all SOM-based methods we set the SOM grid size to .

Implementation

In implementing our models we focused on retaining a fair comparison with the baselines. Hence we decided to use a standard network structure, with fully connected layers of dimensions , to implement both the VAE of our models and the AE of the baselines. The latent dimension, , is set to for the VAE, and to for the AEs. Since the prior in the VAE enforces the latent embeddings to be compact, it also requires more dimensions to learn a meaningful latent space. On the other hand, providing the AE models with a higher-dimensional latent space, needed for the VAE, resulted in a dramatic decrease of performance (see appendix B.2). VarCPSOM is composed of convolutional layers of feature maps and kernel size

for all layers. For all architectures, no greedy layer-wise pretraining was used to tune the VAE. Instead we simply run the VAE without the clustering loss for a few epochs for initialization. A standard SOM was then used to produce an initial configuration of the centroids/neighbourhood relation. Finally, the entire architecture is trained for

iterations. To avoid fine-tuning hyperparameters, given the unsupervised setting,

is set to for all experiments while the other hyperparameters are modified accordingly to maintain the same order of magnitude of the different loss components.

Clustering Evaluation

Table 1

shows the clustering quality results of VarPSOM and VarCPSOM on MNIST and Fashion-MNIST data, compared with the baselines. Purity and Normalized Mutual Information are used as evaluation metrics. We observe that our proposed models outperform the baselines of both categories and achieve state-of-the-art clustering performance.

MNIST fMNIST
Kmeans
DEC
IDEC - -
VarIDEC (ours)
SOM
AE+SOM
SOM-VAE
DESOM
VarPSOM (ours)
VarCPSOM (ours)
Table 1: Clustering performance of VarPSOM using 64 clusters arranged in a

SOM map, compared with baselines. The methods are grouped into approaches with no topological structure in the discrete latent space and interpretable methods using a SOM-based structure in the latent space, as well as an extension of our method using convolutional filters. Means and standard errors across 10 runs with different random model initializations are displayed.

VarPSOM vs. IDEC

VarIDEC shows superior clustering performance compared to DEC and IDEC (Table 1). We conclude that the VAE indeed succeeds in capturing a more meaningful latent representation compared to a standard AE. Regarding the second difference, the SOM structure was expected to slightly decrease the clustering performance, due to a trade-off between interpretability and raw clustering performance. However, we do not observe this in our results. Adding the SOM loss rather leads to an increase of the clustering performance. We suspect this is due to the regularization effect of the SOM’s topological structure. Overall, VarPSOM outperforms both DEC and IDEC.

Improvement over Training

After obtaining the initial configuration of the SOM structure, both clustering and feature extraction using the VAE are trained jointly. To illustrate that our architecture improves clustering performance over the initial configuration, we plotted NMI and Purity against the number of training iterations in Figure 2. We observe that the performance is stable when increasing the number of epochs and no overfitting is visible.

(a)
(b)
Figure 2: NMI (left) and Purity (right) performance of VarPSOM over the number of epochs on the MNIST test set.

Role of the SOM loss

To investigate the influence of the SOM loss component, we plot the clustering performance of VarPSOM against the weight () of in Fig. 3, using the MNIST dataset. With , the term (responsible for improving clustering purity) and the term (responsible for enforcing a SOM structure over the centroids) are almost equal. It is interesting to observe the different trends in NMI and purity. The NMI performance increases for increasing values of while purity slightly decreases. Overall, enforcing a more interpretable latent space results in a more robust clustering model with higher NMI clustering performance.

(a)
(b)
Figure 3: NMI (left) and Purity (right) performance of VarPSOM, with standard error, over values on MNIST test set.

Time Series Evaluation

We evaluate the clustering performance of our proposed models on the eICU dataset, comprised of complex medical time series. We compare them against SOM-VAE, as this is the only method among the baselines that is suited for temporal data. Table 2 shows the cluster cell enrichment in terms of NMI for three different labels, the current (APACHE-0) and worst future (APACHE-6/12 hours) physiology scores. VarTPSOM clearly achieves superior clustering performance compared to SOM-VAE. This, we hypothesize, is due to the better feature extraction using a VAE as well as the improved treatment of uncertainty using PSOM, which features soft assignments, whereas SOM-VAE contains a deterministic AE and hard assignments. Moreover, both the smoothness loss and the prediction loss seem to increase the clustering performance. More results on ICU time series are reported in the appendix (B.3).

To quantify the performance of VarTPSOM in unrolling future trajectories, we predict the final latent embeddings of each time series. For each predicted embedding we reconstruct the input using the decoder of the VAE. Finally, we measure the MSE between the original input and the reconstructed inputs for the last 6 hours of the ICU admission. As baselines, we used an LSTM that takes as input the first hours of the time series and then predicts the next hours. Since most of the trajectories tend to stay in the same state over long periods of time, another strong baseline is obtained by duplicating the last seen embedding over the final 6 hours. The results (Table 3) indicate that the joint training of clustering and prediction used by VarTPSOM clearly outperforms the 2 baselines.

Model APACHE-12 APACHE-6 APACHE-0
SOM-VAE
VarPSOM
VarTPSOM ()
VarTPSOM
Table 2: Mean NMI and standard error of cluster enrichment vs. current/future APACHE physiology scores, using a 2D (8 8) SOM map, across 10 runs with different random model initializations.
Model LSTM SameState VarTPSOM
MSE
Table 3: MSE for predicting the time series of the last 6 hours before ICU dispatch, given the prior time series.

Interpretability

(a) MNIST
(b) Fashion MNIST
Figure 4: Reconstructions of MNIST / Fashion MNIST data from SOM cells in the 8x8 grid learned by VarPSOM, illustrating the topological neighbourhood structure induced by our method, which aids interpretability.

To illustrate the topological structure in the latent space, we present reconstructions of the VarPSOM centroids, arranged in a grid, on static MNIST/Fashion-MNIST data in Figure 4. On the ICU time series data, we show example trajectories for one patient dying at the end of the ICU stay, as well as two control patients which are dispatched healthily from the ICU. We observe that the trajectories are located in different parts of the SOM grid, and form a smooth and interpretable representation (Fig. 5). For further results, including a more quantitative evaluation using randomly sampled trajectories, enrichment for future mortality as well as an illustration of how the uncertainty generated by the soft assignments can help in data visualization, we refer to the appendix (B.4).

(a) Patient dispatched expired
(b) Patient dispatched healthy 1
(c) Patient dispatched healthy 2
Figure 5: Illustration of 3 example patient trajectories between the beginning of the time series and ICU dispatch, in the 2D SOM grid of VarTPSOM. The heatmap shows the enrichment of cells for the current APACHE physiology score. We observe qualitative differences in the trajectories the dying and the healthy patients.

5 Conclusion

We presented two novel methods for interpretable unsupervised clustering, VarPSOM and VarTPSOM. Both models make use of a VAE and a novel clustering method, PSOM, that extends the classical SOM algorithm to include a centroid-based probability distribution. Our models achieve superior clustering performance compared to state-of-the-art deep clustering baselines on benchmark data sets and real-world medical time series. The use of a VAE for feature extraction, instead of an AE, used in previous methods, and the use of soft assignments of data points to clusters result in an interpretable model that can quantify uncertainty in the data.

Acknowledgments

This project was supported by the grant #2017‐110 of the Strategic Focus Area "Personalized Health and Related Technologies (PHRT)" of the ETH Domain. VF, MH are partially supported by ETH core funding (to GR). MH is supported by the Grant No. 205321_176005 of the Swiss National Science Foundation (to GR). We thank Natalia Marciniak for her administrative efforts.

References

Appendix

Appendix A Self-Organizing Maps

Among various existing interpretable unsupervised learning algorithms, Kohonen’s self-organizing map (SOM)

[58325] is one of the most popular models. It is comprised of neurons connected to form a discrete topological structure. The data are projected onto this topographic map which locally approximates the data manifold. Usually it is a finite two-dimensional region where neurons are arranged in a regular hexagonal or rectangular grid. Here we use a grid, , because of its simplicity and its visualization properties. Each neuron , at position of the grid, for , corresponds to a centroid vector, in the input space. The centroids are tied by a neighborhood relation, here defined as .

Given a random initialization of the centroids, the SOM algorithm randomly selects an input and updates both its closest centroid and its neighbors to move them closer to . The algorithm (1) then iterates these steps until convergence.

0:  
  repeat
     At each time , present an input and select the winner,
     Update the weights of the winner and its neighbours,
  until the map converges
Algorithm 1 Self-Organizing Maps

The range of SOM applications includes high dimensional data visualization, clustering, image and video processing, density or spectrum profile modeling, text/document mining, management systems and gene expression data analysis.

Appendix B Experimental and implementation details

b.1 Datasets

  • MNIST: It consists of 70000 handwritten digits of 28-by-28 pixel size. Digits range from 0 to 9, yielding 10 patterns in total. The digits have been size-normalized and centered in a fixed-size image [726791].

  • Fashion MNIST: A dataset of Zalando’s article images consisting of a training set of 60,000 examples and a test set of 10,000 examples [DBLP:journals/corr/abs-1708-07747]. Each example is a 2828 grayscale image, associated with a label from 10 classes.

  • eICU: For temporal data we use vital sign/lab measurements of intensive care unit (ICU) patients resampled to a 1-hour based grid using forward filling and filling with population statistics from the training set if no measurements were available. From all ICU stays, we excluded ICU stays, which were shorter than 1 day, longer than 30 days or which had at least one gap in the continuous vital sign monitoring, which we define by a interval between 2 HR measurements of at least 1 hour. This yielded ICU stays from the eICU database. vital sign variables and lab measurement variables were included, giving an overall data dimension of . The last 72 hours of these multivariate time series were used for the experiments. As labels we use a variant of the current dynamic APACHE physiology score (APACHE-0) as well as the worst APACHE score in the next and hours (APACHE-6/12), and the mortality in the next 24 hours. Only those variables from the APACHE score definition which are recorded in the eICU database were taken into account.

Each dataset is divided into training, validation and test sets for both our models and the baselines.

b.2 Latent space dimension

We evaluated the DEC model for different latent space dimensions. Table S1 shows that the AE, used in the DEC model, performs better when a lower dimensional latent space is used.

Latent dimension Purity NMI
Table S1: Mean/Standard error of NMI and purity of DEC model on MNIST test set, across 10 runs with different random model initializations. We use 64 clusters and different latent space dimensions.

b.3 Learning health state representations in the ICU

By enforcing a SOM structure, VarPSOM, as well as VarTPSOM, project the cluster centroids onto a discrete 2D grid. Such a grid is particularly suited for visualization purposes and relations between centroids become immediately intuitive. In Fig. S1 a heat-map (colored according to enrichment in the current APACHE score, as well as mortality risk in the next 24 hours) shows compact enrichment structures. VarTPSOM succeeds in creating a meaningful and smooth neighbourhood structure. It distinguishes risk profiles with practically zero mortality risk from high mortality risk, reaching up to 15 %, in different regions of the map, even though it is learned in a purely unsupervised fashion. Remarkably, the two heat-maps ((b)b and (a)a) show different enrichment patterns. Clusters which are enriched in health states with higher APACHE scores often do not correspond exactly to clusters with a higher mortality risk. This suggests that traditional representations of physiologic values, such as the APACHE score, fail to fully use all complex multivariate relationships present in the ICU recordings, and are not associated with dynamic mortality in a simple way.

(a) Current APACHE score
(b) Mortality risk in the next 24 hours
Figure S1: Heat-maps of enrichment in mortality risk in the next 24 hours as well as the current dynamic APACHE score, superimposed on the discrete 2D grid learned by VarTPSOM.

b.4 Visualizing health state trajectories in the ICU

To analyze the trend of the patient pathology, VarTPSOM induces trajectories on the 2D SOM grid which can be easily visualized. Fig. S2 shows randomly sampled patient trajectories obtained by our model. Trajectories ending in the death of the patient are shown in red, healthily dispatched patients are shown in green.

Figure S2: Randomly sampled VarTPSOM trajectories, from patients expired at the end of the ICU stay, as well as healthily dispatched patients. Superimposed is a heatmap which displays the cluster enrichment in the current APACHE score, from this model run. We observe that trajectories of dying patients are often in different locations of the map as healthy patients, in particular in those regions enriched for high APACHE scores, which corresponds with clinical intuition.

One of the main advantage of VarTPSOM over the traditional SOM algorithm is the use of soft assignments of data points to clusters which results in a better ability to quantify uncertainty in the data. For visualizing health states in the ICU, this property is very important. In Fig S3 we plot an example patient trajectory, where

different time-steps (in temporal order) of the trajectory were chosen. Our model yields a soft centroid-based probability distribution which evolves with time and which allows estimation of likely discrete health states at a given point in time. For each time-step the distribution of probabilities is plotted using a heat-map, whereas the overall trajectory is plotted using a black line. The circle and cross indicate ICU admission and dispatch, respectively.

Figure S3: Probabilities over discrete patient health states for 6 different time-steps of the selected time series.