Temporal Phenotyping using Deep Predictive Clustering of Disease Progression

06/15/2020 ∙ by Changhee Lee, et al. ∙ 0

Due to the wider availability of modern electronic health records, patient care data is often being stored in the form of time-series. Clustering such time-series data is crucial for patient phenotyping, anticipating patients' prognoses by identifying "similar" patients, and designing treatment guidelines that are tailored to homogeneous patient subgroups. In this paper, we develop a deep learning approach for clustering time-series data, where each cluster comprises patients who share similar future outcomes of interest (e.g., adverse events, the onset of comorbidities). To encourage each cluster to have homogeneous future outcomes, the clustering is carried out by learning discrete representations that best describe the future outcome distribution based on novel loss functions. Experiments on two real-world datasets show that our model achieves superior clustering performance over state-of-the-art benchmarks and identifies meaningful clusters that can be translated into actionable information for clinical decision-making.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Chronic diseases – such as cystic fibrosis and dementia – are heterogeneous in nature, with widely differing outcomes even in narrow patient subgroups. Disease progression manifests through a broad spectrum of clinical factors, collected as a sequence of measurements in electronic health records, which gives a rise to complex progression patterns among patients (Samal et al., 2011; Yoon et al., 2017). For example, cystic fibrosis evolves slowly, allowing for development of comorbidities and bacterial infections, and creating distinct responses to therapeutic interventions, which in turn makes the survival and quality of life substantially different (Ramos et al., 2017; Lee et al., 2019). Identifying patient subgroups with similar progression patterns can be advantageous for understanding such heterogeneous diseases. This allows clinicians to anticipate patients’ prognoses by comparing to “similar” patients and to design treatment guidelines tailored to homogeneous subgroups (Zhang et al., 2019).

Figure 1: A conceptual illustration of our (real-time) clustering procedure. Here, a new patient is assigned over time to one of the four phenotypes based on the expected future event – either Event A or Event B – as new observations are collected.

Temporal clustering has been recently used as a data-driven framework to partition patients with time-series observations into subgroups of patients. Recent research has typically focused on either finding fixed-length and low-dimensional representations (Zhang et al., 2019; Rusanov et al., 2016) or on modifying the similarity measure (Giannoula et al., 2018; Luong & Chandola, 2017) both in an attempt to apply the existing clustering algorithms to time-series observations. However, clusters identified from these approaches are purely unsupervised – they do not account for patients’ observed outcomes (e.g., adverse events, the onset of comorbidities, etc.) – which leads to heterogeneous clusters if the clinical presentation of the disease differs even for patients with the same outcomes. Thus, a common prognosis in each cluster remains unknown which can mystify the understanding of the underlying disease progression (Boudier et al., 2019; Wami et al., 2013). To overcome this limitation, we focus on predictive clustering (Blockeel et al., 2017) to combine predictions on the future outcomes with clustering. More specifically, we aim at finding cluster assignments and centroids by learning discrete representations of time-series that best describe the future outcome distribution. By doing so, patients in the same cluster share similar future outcomes to provide a prognostic value. Figure 1 illustrates a pictorial depiction of the clustering procedure.

In this paper, we propose an actor-critic approach for temporal predictive clustering, which we call AC-TPC.111Source code available at https://github.com/chl8856/AC_TPC. Our model consists of three networks – an encoder, a selector, and a predictor

– and a set of centroid candidates. The key insight, here, is that we model temporal predictive clustering as learning discrete representations of the input time-series that best describe the future outcome distribution. More specifically, the encoder maps an input time-series into a continuous latent encoding; the selector assigns a cluster (i.e., maps to a discrete representation) to which the input belongs by taking the latent encoding as an input; and the predictor estimates the future outcome distributions conditioned on either the encoding or the centroid of the selected cluster (i.e., the selected discrete representation). The following three contributions render our model to achieve our goal. First, to encourage homogeneous future outcomes in each cluster, we define a clustering objective based on the Kullback-Leibler (KL) divergence between the predictor’s output given the time-series, and that given the assigned centroids. Second, we transform solving a combinatorial problem of identifying clusters into iteratively solving two sub-problems: optimization of the cluster assignments and optimization of the centroids. Finally, we allow “back-propagation” through the sampling process of the selector by adopting actor-critic training

(Konda & Tsitsiklis, 2000).

Throughout the experiments, we show significant performance improvements over the state-of-the-art clustering methods on two real-world medical datasets. To demonstrate the practical significance of our model, we consider a more realistic scenario where the future outcomes of interest are high-dimensional – that is, development of multiple comorbidities in the next year – and interpreting all possible combinations is intractable. Our experiments show that our model can identify meaningful clusters that can be translated into actionable information for clinical decision-making.

2 Problem Formulation

Let and

be random variables for an input feature and an output label (i.e., one or a combination of future outcome(s) of interest) with a joint distribution

(and marginal distributions are and ) where is the feature space and is the label space. Here, we focus our description on -class classification tasks, i.e., .222In the Supplementary Material, we discuss simple modifications for regression and -dimensional binary classification tasks . We are given a time-series dataset comprising sequences of realizations (i.e., observations) of the pair for patients. Here, is a sequence of observation pairs that correspond to patient and denotes the time stamp at which the observations are made. From this point forward, we omit the dependency on when it is clear in the context and denote .

Our aim is to identify a set of predictive clusters, , for time-series data. Each cluster consists of homogeneous data samples, that can be represented by its centroid, based on a certain similarity measure. There are two main distinctions from the conventional notion of clustering. First, we treat subsequences of each times-series as data samples and focus on partitioning into . Hence, we define a cluster as for where is the cluster assignment for a given . This is to flexibly update the cluster assignment (in real-time) to which a patient belongs as new observations are being accrued over time. Second, we define the similarity measure with respect to the label distribution and associate it with clusters to provide a prognostic value. More specifically, we want the distribution of output label for subsequences in each cluster to be homogeneous and, thus, can be well-represented by the centroid of that cluster.

Let be a random variable for the cluster assignment – that depends on a given subsequence – and be a random variable for the output given cluster . Then, such property of predictive clustering can be achieved by minimizing the following Kullback-Leibler (KL) divergence: for which is defined as where and are the label distributions conditioned on a subsequence and a cluster assignment , respectively. Note that the KL divergence achieves its minimum when the two distributions are equivalent.

Finally, we establish our goal as identifying a set of predictive clusters that optimizes the following objective:

(1)

Unfortunately, the optimization problem in (1) is highly non-trivial. We need to estimate the objective function in (1) while solving a non-convex combinatorial problem of finding the optimal cluster assignments and cluster centroids.

3 Method: AC-TPC

Figure 2: The block diagram of AC-TPC. The red line implies the procedure of estimating via a sampling process and the blue line implies that of estimating .

To effectively estimate the objective function in (1), we introduce three networks – an encoder, a selector, and a predictor – and an embedding dictionary as illustrated in Figure 2

. These components together provide the cluster assignment and the corresponding centroid based on a given sequence of observations and enable us to estimate the probability density

. More specifically, we define each component as follows:

  • [leftmargin=1.2em]

  • The encoder, , is a RNN (parameterized by ) that maps a (sub)sequence of a time-series to a latent representation (i.e., encoding) where is the latent space.

  • The selector, , is a fully-connected network (parameterized by ) that provides a probabilistic mapping to a categorical distribution from which the cluster assignment is being sampled.

  • The predictor, , is a fully-connected network (parameterized by ) that estimates the label distribution given the encoding of a time-series or the centroid of a cluster.

  • The embedding dictionary, where for , is a set of cluster centroids lying in the latent space which represents the corresponding cluster.

Here, is a

-simplex that denotes the probability distribution for a

-dimensional categorical (class) variable.

At each time stamp , the encoder maps a input (sub)sequence into a latent encoding . Then, based on the encoding , the cluster assignment is drawn from a categorical distribution that is defined by the selector output, i.e., where . Once the assignment is chosen, we allocate the latent encoding to an embedding in the embedding dictionary . Since the allocated embedding corresponds to the centroid of the cluster to which belongs, we can, finally, estimate the density in (1) as the output of the predictor given the embedding , i.e., .

3.1 Loss Functions

In this subsection, we define loss functions to achieve our objective in (1); the details of how we train our model will be discussed in the following subsection.

Predictive Clustering Loss: Since finding the cluster assignment of a given sequence is a probabilistic problem due to the sampling process, the objective function in (1) must be defined as an expectation over the cluster assignment. Thus, we can estimate solving the objective problem in (1) as minimizing the following loss function:

(2)

where . Here, we slightly abuse the notation and denote

as the one-hot encoding of

, and and indicates the -th component of and , respectively. It is worth to highlight that minimizing is equivalent to minimizing the KL divergence in (1) since the former term of the KL divergence is independent of our optimization procedure.

One critical question that may arise is how to avoid trivial solutions in this unsupervised setting of identifying the cluster assignments and the centroids (Yang et al., 2017). For example, all the embeddings in may collapse into a single point or the selector simply assigns equal probability to all the clusters regardless of the input sequence. In both cases, our model will fail to correctly estimate and, thus, end up finding a trivial solution. To address this issue, we introduce two auxiliary loss functions that are tailored to address this concern. It is worth to highlight that these loss functions are not subject to the sampling process and their gradients can be simply back-propagated.

Sample-Wise Entropy of Cluster Assignment: To motivate sparse cluster assignment such that the selector ultimately selects one dominant cluster for each sequence, we introduce sample-wise entropy of cluster assignment which is given as

(3)

where . The sample-wise entropy achieves its minimum when

becomes an one-hot vector.

Embedding Separation Loss: To prevent the embeddings in from collapsing into a single point, we define a loss function that encourages the embeddings to represent different label distributions, i.e., for , from each other:

(4)

where is reused to quantify the distance between label distributions conditioned on each cluster. We minimize (4) when updating the embedding vectors .

(5)

3.2 Optimization

The optimization problem in (1) is a non-convex combinatorial problem because it comprises not only minimizing the KL divergence but also finding the optimal cluster assignments and centroids. Hence, we propose an optimization procedure that iteratively solves two subproblems: i) optimizing the three networks – the encoder, selector, and predictor – while fixing the embedding dictionary and ii) optimizing the embedding dictionary while fixing the three networks. Pseudo-code of AC-TPC can be found in the Supplementary Material.

3.2.1 Optimizing the Three Network

Finding predictive clusters incorporates the sampling process which is non-differentiable. Thus, to render “back-propagation”, we utilize the training of actor-critic models (Konda & Tsitsiklis, 2000). More specifically, we view the combination of the encoder () and the selector () as the “actor” parameterized by , and the predictor () as the “critic”. The critic takes as input the the output of the actor (i.e., the cluster assignment) and estimates its value based on the sample-wise predictive clustering loss (i.e., ) given the chosen cluster. This, in turn, renders the actor to change the distribution of selecting a cluster to minimize such loss. Thus, it is important for the critic to perform well on the updated output of the actor while it is important for the actor to perform well on the updated loss estimation. As such, the parameters for the actor and the critic need to be updated iteratively.

Given the embedding dictionary fixed (thus, we will omit the dependency on ), we train the actor, i.e., the encoder and the selector, by minimizing a combination of the predictive clustering loss and the entropy of cluster assignments , which is given by where is a coefficient chosen to balance between the two losses. To derive the gradient of this loss with respect , we utilize the ideas from actor-critic models (Konda & Tsitsiklis, 2000) in (S.1) which is displayed at the top; please refer to the Supplementary Material for the detailed derivation. Note that since no sampling process is considered in , we can simply derive .

Iteratively with training the actor, we train the critic, i.e., the predictor, by minimizing the predictive clustering loss as the following: whose gradient with respect to can be givens as . Note that since the critic is independent of the sampling process, the gradient can be simply back-propagated.

3.2.2 Optimizing the Cluster Centroids

Now, once the parameters for the three networks are fixed (thus, we omit the dependency on , , and ), we updated the embeddings in by minimizing a combination of the predictive clustering loss and the embedding separation loss , which is given by where is a coefficient chosen to balance between the two losses.

3.2.3 Initializing AC-TPC via Pre-Training

Since we transform the combinatorial optimization problem in (

1) into iteratively solving two sub-problems, initialization is crucial to achieve better optimization as a similar concern has been addressed in (Yang et al., 2017).

Therefore, we initialize our model based on the following procedure. First, we pre-train the encoder and the predictor by minimizing the following loss function based on the predicted label distribution given the latent encodings of input sequences, i.e., , as the following:

(6)

Minimizing (6) encourages the latent encoding to be enriched with information for accurately predicting the label distribution. Then, we perform -means (other clustering method can be also applied) based on the learned representations to initialize the embeddings and the cluster assignments . Finally, we pre-train the selector by minimizing the cross entropy treating the initialized cluster assignments as the true clusters.

4 Related Work

Temporal clustering, also known as time-series clustering, is a process of unsupervised partitioning of the time-series data into clusters in such a way that homogeneous time-series are grouped together based on a certain similarity measure. Temporal clustering is challenging because i) the data is often high-dimensional – it consists of sequences not only with high-dimensional features but also with many time points – and ii) defining a proper similarity measure for time-series is not straightforward since it is often highly sensitive to distortions (Ratanamahatana et al., 2005). To address these challenges, there have been various attempts to find a good representation with reduced dimensionality or to define a proper similarity measure for times-series (Aghabozorgi et al., 2015).

Recently, (Baytas et al., 2017) and (Madiraju et al., 2018)

proposed temporal clustering methods that utilize low-dimensional representations learned by RNNs. These works are motivated by the success of applying deep neural networks to find “clustering friendly” latent representations for clustering static data

(Xie et al., 2017; Yang et al., 2017). In particular, Baytas et al. (2017) utilized a modified LSTM auto-encoder to find the latent representations that are effective to summarize the input time-series and conducted -means on top of the learned representations as an ad-hoc process. Similarly, (Madiraju et al., 2018) proposed a bidirectional-LSTM auto-encoder that jointly optimizes the reconstruction loss for dimensionality reduction and the clustering objective. However, these methods do not associate a target property with clusters and, thus, provide little prognostic value about the underlying disease progression.

Our work is most closely related to SOM-VAE (Fortuin et al., 2019)

. This method jointly optimizes a static variational auto-encoder (VAE), that finds latent representations of input features, and a self-organizing map (SOM), that allows to map the latent representations into a more interpretable discrete representations, i.e., the embeddings. However, there are three key differences between our work and SOM-VAE. First, SOM-VAE aims at minimizing the reconstruction loss that is specified as the mean squared error between the original input and the reconstructed input based on the corresponding embedding. Thus, similar to the aforementioned methods, SOM-VAE neither associates future outcomes of interest with clusters. In contrast, we focus on minimizing the KL divergence between the outcome distribution given the original input sequence and that given the corresponding embedding to build association between future outcomes of interest and clusters. Second, to overcome non-differentiability caused by the sampling process (that is, mapping the latent representation to the embeddings),

(Fortuin et al., 2019) applies the gradient copying technique proposed by (van den Oord et al., 2017), while we utilize the training of actor-critic model (Konda & Tsitsiklis, 2000)

. Finally, while we flexibly model time-series using LSTM, SOM-VAE handles time-series by integrating a Markov model in the latent representations. This can be a strict assumption especially in clinical settings where a patient’s medical history is informative for predicting the future clinical outcomes

(Ranganath et al., 2016).

5 Experiments

In this section, we provide a set of experiments using two real-world time-series datasets. We iteratively update the three networks – the encoder, selector, and predictor – and the embedding dictionary as described in Section 3.2. For the network architecture, we constructed the encoder utilizing a single-layer LSTM (Hochreiter & Schmidhuber, 1997) with 50 nodes and constructed the selector and predictor utilizing two-layer fully-connected network with 50 nodes in each layer, respectively. The parameters are initialized by Xavier initialization (Glorot & Bengio, 2010) and optimized via Adam optimizer (Kingma & Ba, 2014) with learning rate of and keep probability . We chose the balancing coefficients utilizing grid search that achieves the minimum validation loss in (2); the effect of different loss functions are further investigated in the experiments. Here, all the results are reported using 5 random 64/16/20 train/validation/test splits.

5.1 Real-World Datasets

We conducted experiments to investigate the performance of AC-TPC on two real-world medical datasets; detailed statistics of each dataset can be found in the Supplementary Material.

UK Cystic Fibrosis registry (UKCF)333https://www.cysticfibrosis.org.uk: This dataset records annual follow-ups for 5,171 adult patients (aged 18 years or older) enrolled in the UK CF registry over the period from 2008 and 2015, with a total of 25,012 hospital visits. Each patient is associated with 89 variables (i.e., 11 static and 78 time-varying features), including information on demographics and genetic mutations, bacterial infections, lung function scores, therapeutic managements, and diagnosis on comorbidities. We set the development of different comorbidities in the next year as the label of interest at each time stamp.

Alzheimer’s Disease Neuroimaging Initiative (ADNI)444https://adni.loni.usc.edu: This dataset consists of 1,346 patients in the Alzheimer’s disease study with a total of 11,651 hospital visits, which tracks the disease progression via follow-up observations at 6 months interval. Each patient is associated with 21 variables (i.e., 5 static and 16 time-varying features), including information on demographics, biomarkers on brain functions, and cognitive test results. We set predictions on the three diagnostic groups – normal brain functioning, mild cognitive impairment, and Alzheimer’s disease – as the label of interest at each time stamp.

5.2 Benchmarks

We compare AC-TPC with clustering methods ranging from conventional approaches based on -means to the state-of-the-art approaches based on deep neural networks. All the benchmarks compared in the experiments are tailored to incorporate time-series data as described below:

Dynamic time warping followed by -means: Dynamic time warping (DTW) is utilized to quantify pairwise distance between two variable-length sequences and, then, -means is applied (KM-DTW).

-means with deep neural networks: To handle variable-length time-series data, we utilize our encoder and predictor that are trained based on (6) for fixed-length dimensionality reduction. Then, we apply -means on the latent encodings (KM-E2P ()) and on the predicted label distributions (KM-E2P ()), respectively.

Extensions of DCN (Yang et al., 2017): Since the DCN is designed for static data, we replace their static auto-encoder with a sequence-to-sequence network to incorporate time-series data (DCN-S2S).555This extension is a representative of recent deep learning approaches for clustering of both static data (Xie et al., 2017; Yang et al., 2017) and time-series data (Baytas et al., 2017; Madiraju et al., 2018) since these methods are built upon the same concept – that is, applying deep networks for dimensionality reduction to conduct conventional clustering methods, e.g., -means. To associated with the label distribution, we compare a DCN whose static auto-encoder is replaced with our encoder and predictor (DCN-E2P) to focus dimensionality reduction while preserving information for label prediction.

SOM-VAE (Fortuin et al., 2019): We compare with SOM-VAE – though, this method aims at visualizing input – since it naturally clusters time-series data (SOM-VAE). In addition, we compare with a variation of SOM-VAE by replacing the decoder with our predictor to find embeddings that capture information for predicting the label (SOM-VAE-P). For both cases, we set the dimension of SOM to .

It is worth highlighting that the label information is provided for training DCN-E2P, KM-E2P, and SOM-VAE-P while the label information is not provided for training KM-DTW, DCN-S2S, and SOM-VAE. Please refer to the Supplementary Material for the summary of major components of the tested benchmarks and the implementation details.

5.3 Performance Metrics

Clustering Performance: We applied the following three standard metrics for evaluating clustering performances when the ground-truth cluster label is available: purity score, normalized mutual information (NMI) (Vinh et al., 2010), and adjusted Rand index (ARI) (Hubert & Arabie, 1985). More specifically, the purity score assesses how homogeneous each cluster is (ranges from 0 to 1 where 1 being a cluster consists of a single class), the NMI is an information theoretic measure of how much information is shared between the clusters and the labels that is adjusted for the number of clusters (ranges from 0 to 1 where 1 being a perfect clustering), and ARI is a corrected-for-chance version of the Rand index which is a measure of the percentage of correct cluster assignments (ranges from -1 to 1 where 1 being a perfect clustering and 0 being a random clustering).

When the ground-truth label is not available, we utilize the average Silhouette index (SI) (Rousseeuw, 1987) which measures how similar a member is to its own cluster (homogeneity within a cluster) compared to other clusters (heterogeneity across clusters). Formally, the SI for a subsequence can be given as follows: where and . Here, we used the L1-distance between the ground-truth labels of the future outcomes of interest since our goal is to group input subsequences with similar future outcomes.

Prediction Performance: To assess the prediction performance of the identified predictive clusters, we utilized both a

rea under receiver operator characteristic curve

(AUROC) and area under precision-recall curve (AUPRC) based on the label predictions of each cluster and the ground-truth binary labels on the future outcomes of interest. Note that the prediction performance is available only for the benchmarks that incorporate the label information during training.

5.4 Clustering Performance

We start with a simple scenario where the true class (i.e., the ground-truth cluster label) is available and the number of classes is tractable. In particular, we set based on the binary labels for the development of three common comorbidities of cystic fibrosis – diabetes, ABPA, and intestinal obstruction – in the next year for the UKCF dataet and based on the mutually exclusive three diagnostic groups for the ADNI dataset. We compare AC-TPC against the aforementioned benchmarks with respect to the clustering and prediction performance in Table 1.

Dataset Method Purity NMI ARI AUROC AUPRC
UKCF KM-DTW 0.5730.01 0.0100.01 0.0140.01 N/A N/A
KM-E2P () 0.7190.01 0.2110.01 0.1070.01 0.7260.01 0.4250.02
KM-E2P () 0.7510.01 0.3250.01 0.4400.02 0.8070.00 0.5140.01
DCN-S2S 0.6070.06 0.0590.08 0.0630.09 N/A N/A
DCN-E2P 0.7510.02 0.2750.02 0.1840.01 0.7720.03 0.4870.03
SOM-VAE 0.5730.01 0.0060.00 0.0060.01 N/A N/A
SOM-VAE-P 0.6380.04 0.2010.05 0.2830.17 0.7540.05 0.3310.07
Proposed 0.8070.01 0.4630.01 0.6020.01 0.8430.01 0.6050.01
ADNI KM-DTW 0.5660.02 0.0190.02 0.0060.02 N/A N/A
KM-E2P () 0.7360.03 0.2490.02 0.2300.03 0.7070.01 0.5090.01
KM-E2P () 0.7760.05 0.2640.07 0.3170.11 0.7560.04 0.5030.04
DCN-S2S 0.5670.02 0.0050.00 0.0000.01 N/A N/A
DCN-E2P 0.7490.06 0.2610.05 0.2150.06 0.7210.03 0.5090.03
SOM-VAE 0.5660.02 0.0400.06 0.0110.02 N/A N/A
SOM-VAE-P 0.5860.06 0.0850.08 0.0380.06 0.5970.10 0.3760.05
Proposed 0.7860.03 0.2850.04 0.3300.06 0.7680.02 0.5150.02
indicates ,    indicates
Table 1: Performance comparison on the UKCF and ADNI datasets.

As shown in Table 1, AC-TPC achieved performance gain over all the tested benchmarks in terms of both clustering and prediction performance – where most of the improvements were statistically significant with or – for both datasets. Importantly, clustering methods – i.e., KM-DTW, DCN-S2S, and SOM-VAE – that do not associate with the future outcomes of interest identified clusters that provide little prognostic value on the future outcomes (note that the true class is derived from the future outcome of interest). This is clearly shown by the ARI value near 0 which indicates that the identified clusters have no difference with random assignments. Therefore, similar sequences with respect to the latent representations tailored for reconstruction or with respect to the shape-based measurement using DTW can have very different outcomes.

(a) The averaged purity score.
(b) The averaged NMI.
(c) The averaged ARI.
Figure 3:

The purity score, NMI, and ARI (mean and 95% confidence interval) for the UKCF dataset (

) with various .

In Figure 3, we further investigate the purity score, NMI, and ARI by varying the number of clusters from to on the UKCF dataset in the same setting with that stated above (i.e., ). Here, the three methods – i.e., KM-DTW, DCN-S2S, and SOM-VAE – are excluded for better visualization. As we can see in Figure 3, our model rarely incur performance loss in both NMI and ARI while the benchmarks (except for SOM-VAE-P) showed significant decrease in the performance as increased (higher than ). This is because the number of clusters identified by AC-TPC (i.e., the number of activated clusters where we define cluster is activated if ) was the same with most of the times, while the DCN-based methods identified exactly clusters (due to the -means). Since the NMI and ARI are adjusted for the number of clusters, a smaller number of identified clusters yields, if everything being equal, a higher performance. In contrast, while our model achieved the same purity score for , the benchmark showed improved performance as increased since the purity score does not penalize having many clusters. This is an important property of AC-TPC that we do not need to know a priori what the number of cluster is which is a common practical challenge of applying the conventional clustering methods (e.g., -means).

The performance gain of our model over SOM-VAE-P (and, our analysis is the same for SOM-VAE) comes from two possible sources: i) SOM-VAE-P mainly focuses on visualizing the input with SOM which makes both the encoder and embeddings less flexible – this is why it performed better with higher – and ii) the Markov property can be too strict for time-series data especially in clinical settings where a patient’s medical history is informative for predicting the future clinical outcomes (Ranganath et al., 2016).

(a) AUROC
(b) AUPRC
(c) Average SI
Figure 4: AUROC, AUPRC, and average SI (mean and 95% confidence interval) and the number of activated clusters with various .

5.5 Multiple Future Outcomes – a Practical Scenario

In this experiment, we focus on a more practical scenario where the future outcome of interest is high-dimensional and, thus, the number of classes based on all the possible combinations of future outcomes becomes intractable. Suppose that we are interested in the development of comorbidities in the next year whose possible combinations grow exponentially . Interpreting such a large number of patient subgroups will be a daunting task which hinders the understanding of underlying disease progression. Since different comorbidities may share common driving factors (Ronan et al., 2017), we hope our model to identify much smaller underlying (latent) clusters that govern the development of comorbidities. Here, to incorporate with comorbidities (i.e., binary labels), we redefine the output space as and modify the predictor and loss functions, accordingly.

Figure 5: Clusters with high-risk of developing diabetes.

We identified clusters of patients based on the next-year development of different comorbidities in the UKCF dataset and reported clusters in Figure 5 – Cluster 0, 5, 7, 8, and 10 – with the frequency of developing important comorbidities in the next year. Here, we selected the clusters that have the highest risk of developing diabetes in the next year, and the frequency is calculated in a cluster-specific fashion using the true label. A full list of clusters and comorbidity frequencies can be found in the Supplementary Material. Although all these clusters displayed high risk of diabetes, the frequency of other co-occurred comorbidities was significantly different across the clusters. In particular, around 89% of the patients in Cluster 5 experienced asthma in the next year while it was less than 3% of the patients in the other cluster. Interestingly, “leukotriene” – a medicine commonly used to manage asthma – and “FEV% predicted” – a measure of lung function – were the two most different input features between patients in Cluster 5 and those in the other clusters. We observed similar findings in Cluster 7 with ABPA, Cluster 8 with liver disease, and Cluster 10 with osteopenia. Therefore, by grouping patients who are likely to develop a similar set of comorbidities, our method identified clusters that can be translated into actionable information for clinical decision-making.

5.6 Trade-Off between Clustering and Prediction

In predictive clustering, the trade-off between the clustering performance (for better interpretability) – which quantifies how the data samples are homogeneous within each cluster and heterogeneous across clusters with respect to the future outcomes of interest – and the prediction performance is a common issue. The most important parameter that governs this trade-off is the number of clusters. More specifically, increasing the number of clusters will make the predictive clusters have higher diversity to represent the output distribution and, thus, will increase the prediction performance while decreasing the clustering performance. One extreme example is that there are as many clusters as data samples which will make the identified clusters fully individualized; as a consequence, each cluster will lose interpretability as it no longer groups similar data samples.

To highlight this trade-off, we conduct experiments under the same experimental setup with that of Section 5.5. For the performance measures, we utilized the AUROC and AUPRC to assess the prediction performance, and utilized the average SI to assess the clustering performance. To control the number of activated clusters, we set and (since the embedding separation loss in (4) controls the activation of clusters) and reported the performance by increasing the number of possible clusters , i.e., the dimension of the embedding dictionary.

As can be seen in Figure 4, the prediction performance increased with a increasing number of identified clusters due to the higher diversity to represent the label distribution while making the identified clusters less interpretable. That is, the cohesion and separation among clusters become ambiguous as shown in the low average SI. On the other hand, when we set (which is selected based on the validation loss in 2), our method consistently identified a similar number of clusters for , i.e., 13.8 on average, in a data-driven fashion and provided slightly reduced prediction performance with significantly better interpretability, i.e., the average SI on average.

6 Conclusion

In this paper, we introduced AC-TPC, a deep learning approach for predictive clustering of time-series data. We defined novel loss functions to encourage each cluster to have homogeneous future outcomes (e.g., adverse events, the onset of comorbidities, etc.) and designed optimization procedures to avoid trivial solutions in identifying cluster assignments and the centroids. Throughout the experiments on two real-world datasets, we showed that our model achieves superior clustering performance over state-of-the-art methods and identifies meaningful clusters that can be translated into actionable information for clinical decision-making.

Acknowledgements

The authors would like to thank the reviewers for their helpful comments. This work was supported by the National Science Foundation (NSF grants 1524417 and 1722516), the US Office of Naval Research (ONR), and the UK Cystic Fibrosis Trust. We thank Dr. Janet Allen (Director of Strategic Innovation, UK Cystic Fibrosis Trust) for the vision and encouragement. We thank Rebecca Cosgriff and Elaine Gunn for the help with data access, extraction and analysis. We also thank Prof. Andres Floto and Dr. Tomas Daniels, our collaborators, for the very helpful clinical discussions.

References

  • Aghabozorgi et al. (2015) Aghabozorgi, S., Shirkhorshidi, A. S., and Wah, T. Y. Time-series clustering – a decade review. Information Systems, 53:16–38, May 2015.
  • Baytas et al. (2017) Baytas, I. M., Xiao, C., Zhang, X., Wang, F., Jain, A. K., and Zhou, J. Patient subtyping via time-aware lstm networks. In Proceedings of the 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017), 2017.
  • Blockeel et al. (2017) Blockeel, H., Dzeroski, S., Struyf, J., and Zenko, B. Predictive Clustering. Springer New York, 2017.
  • Boudier et al. (2019) Boudier, A., Chanoine, S., Accordini, S., Anto, J. M., na, X. B., Bousquet, J., Demoly, P., Garcia-Aymerich, J., Gormand, F., Heinrich, J., Janson, C., Künzli, N., Matran, R., Pison, C., Raherison, C., Sunyer, J., Varraso, R., Jarvis, D., Leynaert, B., Pin, I., and Siroux, V. Data‐driven adult asthma phenotypes based on clinical characteristics are associated with asthma outcomes twenty years later. Allegy, 74(5):953–963, May 2019.
  • Fortuin et al. (2019) Fortuin, V., Hüser, M., Locatello, F., Strathmann, H., and Rätsch, G. SOM-VAE: Interpretable discrete representation learning on time series. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019.
  • Giannoula et al. (2018) Giannoula, A., Gutierrez-Sacristán, A., Bravo, A., Sanz, F., and Furlong, L. I. Identifying temporal patterns in patient disease trajectories using dynamic ping: A population-based study. Scientific Reports, 8(4216):1–14, March 2018.
  • Glorot & Bengio (2010) Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedforward neural networks.

    In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010)

    , 2010.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
  • Hubert & Arabie (1985) Hubert, L. and Arabie, P. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Konda & Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Proceedings of the 13th Conference on Neural Information Processing Systems (NIPS 2000), 2000.
  • Lee et al. (2019) Lee, C., Yoon, J., and van der Schaar, M. Dynamic-deephit: A deep learning approach for dynamic survival analysis with competing risks based on longitudinal data. IEEE Transactions on Biomedical Engineering, April 2019.
  • Luong & Chandola (2017) Luong, D. T. A. and Chandola, V.

    A k-means approach to clustering disease progressions.

    In Proceedings of the 5th IEEE International Conference on Healthcare Informatics (ICHI), 2017.
  • Madiraju et al. (2018) Madiraju, N. S., Sadat, S. M., Fisher, D., and Karimabadi, H. Deep temporal clustering: Fully unsupervised learning of time-domain features. arXiv preprint arXiv:1802.01059, 2018.
  • Ramos et al. (2017) Ramos, K. J., Quon, B. S., Heltshe, S. L., Mayer-Hamblett, N., Lease, E. D., Aitken, M. L., Weiss, N. S., and Goss, C. H. Heterogeneity in survival in adult patients with cystic fibrosis with FEV of predicted in the united states. Chest, 151(6):1320–1328, June 2017.
  • Ranganath et al. (2016) Ranganath, R., Perotte, A., Elhadad, N., and Blei, D. Deep survival analysis. In Proceedings of the 1st Machine Learning for Healthcare Conference (MLHC 2016), 2016.
  • Ratanamahatana et al. (2005) Ratanamahatana, C. A., Keogh, E., Bagnall, A. J., and Lonardi, S. A novel bit level time series representation with implications for similarity search and clustering. In Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2005), 2005.
  • Ronan et al. (2017) Ronan, N. J., Elborn, J., and Plant, B. J. Current and emerging comorbidities in cystic fibrosis. Presse Med., 46(6):125–138, June 2017.
  • Rousseeuw (1987) Rousseeuw, P. J.

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.

    Computational and Applied Mathematics, 20:53–65, 1987.
  • Rusanov et al. (2016) Rusanov, A., Prado, P. V., and Weng, C. Unsupervised time-series clustering over lab data for automatic identification of uncontrolled diabetes. In Proceedings of the 4th IEEE International Conference on Healthcare Informatics (ICHI), 2016.
  • Samal et al. (2011) Samal, L., Wright, A., Wong, B., Linder, J., and Bates, D. Leveraging electronic health records to support chronic disease management: the need for temporal data views. Informatics in Primary Care, 19(2):65–74, 2011.
  • van den Oord et al. (2017) van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.
  • Vinh et al. (2010) Vinh, N. X., Epps, J., and Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(1):2837–2854, October 2010.
  • Wami et al. (2013) Wami, W. M., Buntinx, F., Bartholomeeusen, S., Goderis, G., Mathieu, C., and Aerts, M. Influence of chronic comorbidity and medication on the efficacy of treatment in patients with diabetes in general practice. The British Journal of General Practice, 63(609):267–273, March 2013.
  • Xie et al. (2017) Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2017.
  • Yang et al. (2017) Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.
  • Yoon et al. (2017) Yoon, J., Davtyan, C., and van der Schaar, M. Discovery and clinical decision support for personalized healthcare. IEEE J Biomed Health Inform., 21(4):1133–1145, 2017.
  • Zhang et al. (2019) Zhang, X., Chou, J., Liang, J., Xiao, C., Zhao, Y., Sarva, H., Henchcliffe, C., and Wang, F. Data-driven subtyping of parkinson’s disease using longitudinal clinical records: A cohort study. Scientific Reports, 9(797):1–12, January 2019.

Appendix A AC-TPC for Regression and Binary Classification Tasks

As the task changes, estimating the label distribution and calculating the KL divergence in (1) of the manuscript must be redefined accordingly: For regression task, i.e., , we modify the predictor as and replace by . Minimizing is equivalent to minimizing the KL divergence between and

when we assume these probability densities follow Gaussian distribution with the same variance. For the

-dimensional binary classification task, i.e., , we modify the predictor as and replace by which is required to minimize the KL divergence. Here, and indicate the -th element of and , respectively. The basic assumption here is that the distribution of each binary label is independent given the input sequence.

Appendix B Detailed Derivation of (5)

To derive the gradient of the predictive clustering loss in (5) of the manuscript with respect , we utilized the ideas from actor-critic models (Konda & Tsitsiklis, 2000) on :

(S.1)

where the second equality comes from the following derivation of the former term:

Appendix C Pseudo-Code of AC-TPC

As illustrated in Section 3.2, AC-TPC is trained in an iterative fashion. We provide the pseudo-code for optimizing our model in Algorithm 1 and that for initializing the parameters in Algorithm 2.

  Input: Dataset , number of clusters , coefficients ,            learning rate , mini-batch size , and update step
  Output: AC-TPC parameters and the embedding dictionary
  Initialize parameters and the embedding dictionary via Algorithm 2
  
  repeat
      Optimize the Encoder, Selector, and Predictor
      for  do
          Sample a mini-batch of data samples:
          for  do
              Calculate the assignment probability:
              Draw the cluster assignment:
              Calculate the label distributions: and
          end for
          Update the encoder and selector :
          Update the predictor :
      end for
      
      Optimize the Cluster Centroids
      for  do
          Sample a mini-batch of data samples:
          for  do
              Calculate the assignment probability:
              Draw the cluster assignment:
              Calculate the label distributions:
          end for
          for  do
              Update the embeddings :
          end for
          Update the embedding dictionary:
      end for
  until convergence
Algorithm 1 Pseudo-code for Optimizing AC-TPC
  Input: Dataset , number of clusters , learning rate , mini-batch size
  Output: AC-TPC parameters and the embedding dictionary
  Initialize parameters via Xavier Initializer
  
  Pre-train the Encoder and Predictor
  repeat
      Sample a mini-batch of data samples:
      for  do
          Calculate the label distributions:
      end for
  until convergence
  
  Initialize the Cluster Centroids
  Calculate the embedding dictionary and initial cluster assignments
  Pre-train the Selector
  repeat
      Sample a mini-batch of data samples:
      for  do
          Calculate the assignment probability:
      end for
      Update the selector :
  until convergence
Algorithm 2 Pseudo-code for pre-training AC-TPC
STATIC COVARIATES Type Mean Type Mean
Demographic Gender Bin. 0.55
Genetic Class I Mutation Bin. 0.05 Class VI Mutation Bin. 0.86
Class II Mutation Bin. 0.87 DF508 Mutation Bin. 0.87
Class III Mutation Bin. 0.89 G551D Mutation Bin. 0.06
Class IV Mutation Bin. 0.05 Homozygous Bin. 0.58
Class V Mutation Bin. 0.04 Heterozygous Bin 0.42
TIME-VARYING COVARIATES Type Mean Min / Max Type Mean Min / Max
Demographic Age Cont. 30.4 18.0 / 86.0 Height Cont. 168.0 129.0 / 198.6
Weight Cont. 64.1 24.0 / 173.3 BMI Cont. 22.6 10.9 / 30.0
Smoking Status Bin. 0.1
Lung Func. Scores FEV Cont. 2.3 0.2 / 6.3 Best FEV Cont. 2.5 0.3 / 8.0
FEV% Pred. Cont. 65.1 9.0 / 197.6 Best FEV% Pred. Cont. 71.2 7.5 / 164.3
Hospitalization IV ABX Days Hosp. Cont. 12.3 0 / 431 Non-IV Hosp. Adm. Cont. 1.2 0 / 203
IV ABX Days Home Cont. 11.9 0 / 441
Lung Infections B. Cepacia Bin. 0.05 P. Aeruginosa Bin. 0.59
H. Influenza Bin. 0.05 K. Pneumoniae Bin. 0.00
E. Coli Bin. 0.01 ALCA Bin. 0.03
Aspergillus Bin. 0.14 NTM Bin. 0.03
Gram-Negative Bin. 0.01 Xanthomonas Bin. 0.05
S. Aureus Bin. 0.30
Comorbidities Liver Disease Bin. 0.16 Depression Bin. 0.07
Asthma Bin. 0.15 Hemoptysis Bin. 0.01
ABPA Bin. 0.12 Pancreatitus Bin. 0.01
Hypertension Bin. 0.04 Hearing Loss Bin. 0.03
Diabetes Bin. 0.28 Gall bladder Bin. 0.01
Arthropathy Bin. 0.09 Colonic structure Bin. 0.00
Bone fracture Bin. 0.01 Intest. Obstruction Bin. 0.08
Osteoporosis Bin. 0.09 GI bleed – no var. Bin. 0.00
Osteopenia Bin. 0.21 GI bleed – var. Bin. 0.00
Cancer Bin. 0.00 Liver Enzymes Bin. 0.16
Cirrhosis Bin. 0.03 Kidney Stones Bin. 0.02
Treatments Dornase Alpha Bin. 0.56 Inhaled B. BAAC Bin. 0.03
Anti-fungals Bin. 0.07 Inhaled B. LAAC Bin. 0.08
HyperSaline Bin. 0.23 Inhaled B. SAAC Bin. 0.05
HypertonicSaline Bin. 0.01 Inhaled B. LABA Bin. 0.11
Tobi Solution Bin. 0.20 Inhaled B. Dilators Bin. 0.57
Cortico Combo Bin. 0.41 Cortico Inhaled Bin. 0.15
Non-IV Ventilation Bin. 0.05 Oral B. Theoph. Bin. 0.04
Acetylcysteine Bin. 0.02 Oral B. BA Bin. 0.03
Aminoglycoside Bin. 0.03 Oral Hypo. Agents Bin. 0.01
iBuprofen Bin. 0.00 Chronic Oral ABX Bin. 0.53
Drug Dornase Bin. 0.02 Cortico Oral Bin. 0.14
HDI Buprofen Bin. 0.00 Oxygen Therapy Bin. 0.11
Tobramycin Bin. 0.03 O Exacerbation Bin. 0.03
Leukotriene Bin. 0.07 O Nocturnal Bin. 0.03
Colistin Bin. 0.03 O Continuous Bin. 0.03
Diabetes Insulin Bin. 0.01 O Pro re nata Bin. 0.01
Macrolida ABX Bin. 0.02
ABX: antibiotics
Table S.1: Summary and description of the UKCF dataset.
STATIC COVARIATES Type Mean Min/Max (Mode) Type Mean Min/Max (Mode)
Demographic Race Cat. 0.93 White Ethnicity Cat. 0.97 No Hisp/Latino
Education Cat. 0.23 C16 Marital Status Cat. 0.75 Married
Genetic APOE Cont. 0.44 0/2
TIME-VARYING COVARIATES Type Mean Min / Max Type Mean Min / Max
Demographic Age Cont. 73.6 55/92
Biomarker Entorhinal Cont. 3.6E+3 1.0E+3 / 6.7E+3 Mid Temp Cont. 2.0E+4 8.9E+3 / 3.2E+4
Fusiform Cont. 1.8E+5 9.0E+4 / 2.9E+5 Ventricles Cont. 4.1E+4 5.7E+3 / 1.6E+5
Hippocampus Cont. 6.9E+3 2.8E+3 / 1.1E+4 Whole Brain Cont. 1.0E+6 6.5E+5 / 1.5E+6
Intracranial Cont. 1.5E+6 2.9E+2 / 2.1E+6
Cognitive ADAS-11 Cont. 8.58 0/70 ADAS-13 Cont. 13.61 0/85
CRD Sum of Boxes Cont. 1.21 0/17 Mini Mental State Cont. 27.84 2/30
RAVLT Forgetting Cont. 4.19 -12/15 RAVLT Immediate Cont. 38.25 0/75
RAVLT Learning Cont. 4.65 -5/14 RAVLT Percent Cont. 51.70 -500/100
Table S.2: Summary and description of the ADNI dataset.

Appendix D Details of the Datasets

d.1 UKCF Dataset

UK Cystic Fibrosis registry (UKCF)666https://www.cysticfibrosis.org.uk/the-work-we-do/uk-cf-registry records annual follow-ups for 5,171 adult patients (aged 18 years or older) over the period from 2008 and 2015, with a total of 25,012 hospital visits. Each patient is associated with 89 variables (i.e., 11 static and 78 time-varying features), including information on demographics and genetic mutations, bacterial infections, lung function scores, therapeutic managements, and diagnosis on comorbidities. The detailed statistics are given in Table S.1.

d.2 ADNI Dataset

Alzheimer’s Disease Neuroimaging Initiative (ADNI)777https://adni.loni.usc.edu study consists of 1,346 patients with a total of 11,651 hospital visits, which tracks the disease progression via follow-up observations at 6 months interval. Each patient is associated with 21 variables (i.e., 5 static and 16 time-varying features), including information on demographics, biomarkers on brain functions, and cognitive test results. The three diagnostic groups were normal brain functioning (), mild cognitive impairment (), and Alzheimer’s disease (). The detailed statistics are given in Table S.2.

(a) DCN-S2S
(b) DCN-E2P, KM-E2P
(c) SOM-VAE
(d) SOM-VAE-P
Figure S.1: The block diagrams of the tested benchmarks.
Methods Handling Time-Series Clustering Method Similarity Measure Label Provided Label Associated
KM-DTW DTW -means DTW N N
KM-E2P () RNN -means Euclidean in Y Y (indirect)
KM-E2P () RNN -means Euclidean in Y Y (direct)
DCN-S2S RNN -means Euclidean in N N
DCN-E2P RNN -means Euclidean in Y Y (indirect)
SOM-VAE Markov model embedding mapping reconstruction loss N N
SOM-VAE-P Markov model embedding mapping prediction loss Y Y (direct)
Proposed RNN embedding mapping KL divergence Y Y (direct)
Table S.3: Comparison table of benchmarks.

Appendix E Details of the Benchmarks

We compared AC-TPC in the experiments with clustering methods ranging from conventional approaches based on -means to the state-of-the-art approaches based on deep neural networks. The details of how we implemented the benchmarks are described as the following:

  • [leftmargin=1.5em]

  • Dynamic time warping followed by -means888https://github.com/rtavenar/tslearn: Dynamic time warping (DTW) is utilized to quantify pairwise distance between two variable-length sequences and, then, -means is applied (denoted as KM-DTW).

  • -means with deep neural networks: To handle variable-length time-series data, we utilized an encoder-predictor network as depicted in Figure 0(b) and trained the network based on (6) for dimensionality reduction; this is to provide fixed-length and low-dimensional representations for time-series. Then, we applied -means on the latent encodings (denoted as KM-E2P ()) and on the predicted label distributions (denoted as KM-E2P ()), respectively. We implemented the encoder and predictor of KM-E2P with the same network architectures with those of our model: the encoder is a single-layer LSTM with 50 nodes and the decoder is a two-layered fully-connected network with 50 nodes in each layer.

  • Extensions of DCN999https://github.com/boyangumn/DCN (Yang et al., 2017): Since the DCN is designed for static data, we utilized a sequence-to-sequence model in Figure 0(a) for the encoder-decoder network as an extension to incorporate time-series data (denoted as DCN-S2S) and trained the network based on the reconstruction loss (using the reconstructed input sequence ). For implementing DCN-S2S, we used a single-layer LSTM with 50 nodes for both the encoder and the decoder. And, we augmented a fully-connected layer with 50 nodes is used to reconstruct the original input sequence from the latent representation of the decoder.

    In addition, since predictive clustering is associated with the label distribution, we compared a DCN whose encoder-decoder structure is replaced with our encoder-predictor network in Figure 0(b) (denoted as DCN-E2P) to focus the dimensionality reduction – and, thus, finding latent encodings where clustering is performed – on the information for predicting the label distribution. We implemented the encoder and predictor of DCN-E2P with the same network architectures with those of our model as described in Section 5.

  • SOM-VAE101010https://github.com/ratschlab/SOM-VAE (Fortuin et al., 2019): We compare with SOM-VAE – though, this method is oriented towards visualization of input data via SOM – since it naturally clusters time-series data assuming Markov property (denoted as SOM-VAE). We replace the original CNN architecture of the encoder and the decoder with three-layered fully-connected network with 50 nodes in each layer, respectively. The network architecture is depicted in Figure 0(c) where and indicate the reconstructed inputs based on the encoding and the embedding at time , respectively.

    In addition, we compare with a variation of SOM-VAE by replacing the decoder with the predictor to encourage the latent encoding to capture information for predicting the label distribution (denoted as SOM-VAE-P). For the implementation, we replaced the decoder of SOM-VAE with our predictor which is a two-layered fully-connected layer with 50 nodes in each layer to predict the label distribution as illustrated in Figure 0(d). Here, and indicate the predicted labels based on the encoding and the embedding at time , respectively.

    For both cases, we used the default values for balancing coefficients of SOM-VAE and the dimension of SOM to be equal to .

We compared and summarized major components of the benchmarks in Table S.3.

Appendix F Additional Experiments

f.1 Contributions of the Auxiliary Loss Functions

As described in Section 3.1, we introduced two auxiliary loss functions – the sample-wise entropy of cluster assignment (3) and the embedding separation loss (4) – to avoid trivial solution that may arise in identifying the predictive clusters. To analyze the contribution of each auxiliary loss function, we report the average number of activated clusters, clustering performance, and prediction performance on the UKCF dataset with comorbidities as described in Section 5.4. Throughout the experiment, we set – which is larger than – to find the contribution of these loss functions to the number of activated clusters.

 Coefficients             Clustering Performance Prognostic Value
Activated No. Purity NMI ARI AUROC AUPRC
0.0 0.0 16 0.5730.01 0.0060.00 0.0000.00 0.5000.00 0.1690.00
0.0 1.0 16 0.5730.01 0.0060.00 0.0000.00 0.5000.00 0.1690.00
3.0 0.0 8.4 0.7950.01 0.4310.01 0.5690.01 0.8400.01 0.5830.02
3.0 1.0 8 0.8080.01 0.4680.01 0.6060.01 0.8520.00 0.6080.01
Table S.4: Performance comparison with varying the balancing coefficients for the UKCF dataset.

As we can see in Table S.4, both auxiliary loss functions make important contributions in improving the quality of predictive clustering. More specifically, the sample-wise entropy encourages the selector to choose one dominant cluster. Thus, as we can see results with , without the sample-wise entropy, our selector assigns an equal probability to all clusters which results in a trivial solution. We observed that, by augmenting the embedding separation loss (4), AC-TPC identifies a smaller number of clusters owing to the well-separated embeddings.

f.2 Additional Results on Targeting Multiple Future Outcomes

Throughout the experiment in Section 5.5, we identified subgroups of patients that are associated with the next-year development of different comorbidities in the UKCF dataset. In Table S.5, we reported identified clusters and the full list of comorbidities developed in the next year since the latest observation and the corresponding frequency which is calculated in a cluster-specific fashion based on the true label.

As we can see in the table, the identified clusters displayed very different label distributions; that is, the combination of comorbidities as well as their frequency were very different across the clusters. For example, patients in Cluster 1 experienced low-risk of developing any comorbidities in the next year while 85% of patients in Cluster 0 experienced diabetes in the next year.

Clusters Comorbidities and Frequencies
Cluster 0 Diabetes (0.85) Liver Enzymes (0.21) Arthropathy (0.14) Depression (0.10)
Hypertens (0.08) Osteopenia (0.07) Intest. Obstruction (0.07) Cirrhosis (0.04)
ABPA (0.04) Liver Disease (0.04) Osteoporosis (0.03) Hearing Loss (0.03)
Asthma (0.02) Kidney Stones (0.01) Bone fracture (0.01) Hemoptysis (0.01)
Pancreatitis (0.01) Cancer (0.00) Gall bladder (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 1 Liver Enzymes (0.09) Arthropathy (0.08) Depression (0.07) Intest. Obstruction (0.06)
Diabetes (0.06) Osteopenia (0.05) ABPA (0.04) Asthma (0.03)
Liver Disease (0.03) Hearing Loss (0.03) Osteoporosis (0.02) Pancreatitis (0.02)
Kidney Stones (0.02) Hypertension (0.01) Cirrhosis (0.01) Gall bladder (0.01)
Cancer (0.01) Hemoptysis (0.00) Bone fracture (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 2 ABPA (0.77) Osteopenia (0.21) Intest. Obstruction (0.11) Hearing Loss (0.10)
Liver Enzymes (0.07) Diabetes (0.06) Depression (0.05) Pancreatitis (0.05)
Liver Disease (0.04) Arthropathy (0.04) Asthma (0.03) Bone fracture (0.02)
Osteoporosis (0.02) Hypertension (0.01) Cancer (0.01) Cirrhosis (0.01)
Kidney Stones (0.01) Gall bladder (0.01) Hemoptysis (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 3 Asthma (0.89) Liver Disease (0.87) Diabetes (0.29) Osteopenia (0.28)
Liver Enzymes (0.24) ABPA (0.15) Osteoporosis (0.11) Hearing Loss (0.06)
Arthropathy (0.05) Intest. Obstruction (0.05) Depression (0.04) Hypertension (0.03)
Cirrhosis (0.02) Kidney Stones (0.02) Pancreatitis (0.02) Gall bladder (0.02)
Cancer (0.01) Bone fracture (0.00) Hemoptysis (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 4 Osteoporosis (0.76) Diabetes (0.43) Arthropathy (0.20) Liver Enzymes (0.18)
Osteopenia (0.15) Depression (0.13) Intest. Obstruction (0.11) ABPA (0.11)
Hearing Loss (0.09) Liver Disease (0.08) Hypertension (0.07) Cirrhosis (0.07)
Kidney Stones (0.03) Asthma (0.02) Hemoptysis (0.02) Bone fracture (0.02)
Gall bladder (0.01) Pancreatitis (0.01) Cancer (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 5 Asthma (0.88) Diabetes (0.81) Osteopenia (0.28) ABPA (0.24)
Liver Enzymes (0.22) Depression (0.15) Osteoporosis (0.14) Intest. Obstruction (0.12)
Hypertension (0.10) Cirrhosis (0.10) Liver Disease (0.09) Arthropathy (0.08)
Bone fracture (0.01) Hemoptysis (0.01) Pancreatitis (0.01) Hearing Loss (0.01)
Cancer (0.01) Kidney Stones (0.01) GI bleed – var. (0.01) Gall bladder (0.00)
Colonic stricture (0.00) GI bleed – no var. (0.00)
Cluster 6 Liver Disease (0.85) Liver Enzymes (0.37) Osteopenia (0.27) ABPA (0.09)
Arthropathy (0.07) Diabetes (0.06) Intest. Obstruction (0.06) Osteoporosis (0.05)
Depression (0.03) Asthma (0.03) Hearing Loss (0.03) Cirrhosis (0.02)
Hemoptysis (0.02) Hypertension (0.01) Kidney Stones (0.01) Pancreatitis (0.00)
Gall bladder (0.00) Bone fracture (0.00) Cancer (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 7 ABPA (0.83) Diabetes (0.78) Osteopenia (0.25) Osteoporosis (0.24)
Liver Enzymes (0.15) Intest. Obstruction (0.12) Liver Disease (0.09) Hypertension (0.07)
Hearing Loss (0.07) Arthropathy (0.06) Depression (0.04) Cirrhosis (0.02)
Asthma (0.01) Bone fracture (0.01) Kidney Stones (0.01) Hemoptysis (0.01)
Cancer (0.00) Pancreatitis (0.00) Gall bladder (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 8 Diabetes (0.94) Liver Disease (0.83) Liver Enzymes (0.43) Osteopenia (0.30)
Hearing Loss (0.11) Osteoporosis (0.10) Intest. Obstruction (0.09) Cirrhosis (0.08)
Depression (0.08) ABPA (0.07) Arthropathy (0.06) Hypertension (0.05)
Kidney Stones (0.05) Asthma (0.02) Hemoptysis (0.01) Bone fracture (0.01)
Cancer (0.00) Pancreatitis (0.00) Gall bladder (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 9 Asthma (0.89) Osteopenia (0.26) ABPA (0.19) Arthropathy (0.14)
Intest. Obstruction (0.11) Depression (0.08) Osteoporosis (0.08) Diabetes (0.06)
Liver Enzymes (0.06) Hemoptysis (0.03) Hypertension (0.02) Liver Disease (0.02)
Pancreatitis (0.02) Bone fracture (0.01) Cancer (0.01) Cirrhosis (0.01)
Gall bladder (0.01) Hearing Loss (0.01) Kidney Stones (0.00) Colonic stricture (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Cluster 10 Osteopenia (0.82) Diabetes (0.81) Arthropathy (0.23) Depression (0.19)
Liver Enzymes (0.18) Hypertension (0.16) Hearing Loss (0.10) Liver Disease (0.10)
Osteoporosis (0.10) Intest. Obstruction (0.09) ABPA (0.09) Kidney Stones (0.07)
Cirrhosis (0.05) Asthma (0.01) Cancer (0.00) GI bleed – var. (0.00)
Bone fracture (0.00) Hemoptysis (0.00) Pancreatitis (0.00) Gall bladder (0.00)
Colonic stricture (0.00) GI bleed – no var. (0.00)
Cluster 11 Osteopenia (0.77) Liver Enzymes (0.18) Arthropathy (0.12) Depression (0.09)
Hypertension (0.06) Diabetes (0.06) Hearing Loss (0.06) ABPA (0.05)
Liver Disease (0.05) Osteoporosis (0.04) Intest. Obstruction (0.04) Cirrhosis (0.02)
Asthma (0.02) Pancreatitis (0.02) Bone fracture (0.01) Cancer (0.01)
Kidney Stones (0.00) Gall bladder (0.00) Colonic stricture (0.00) Hemoptysis (0.00)
GI bleed – no var. (0.00) GI bleed – var. (0.00)
Table S.5: The comorbidities developed in the next year for the identified clusters. The values in parentheses indicate the corresponding frequency.

f.3 How Does the Temporal Phenotypes Change over Time?

In this subsection, we demonstrate run-time examples of how AC-TPC flexibly updates the cluster assignments over time with respect to the future development of comorbidities in the next year. Figure S.2 illustrates three representative patients:

  • [leftmargin=1.5em]

  • Patient A had diabetes from the beginning of the study and developed asthma as an additional comorbidity at . Accordingly, AC-TPC changed the temporal phenotype assigned to this patient from Cluster 0, which consists of patients who are very likely to develop diabetes but very unlikely to develop asthma in the next year, to Cluster 5, which consists of patients who are likely to develop both diabetes and asthma in the next year, at .

  • Patient B had ABPA from the beginning of the study and developed diabetes at . Similarly, AC-TPC changed the temporal phenotype assigned to this patient from Cluster 2, which consists of patients who are likely to develop ABPA but not diabetes in the next year, to Cluster 7, which consists of patients who are likely to develop both ABPA and diabetes in the next year, at .

  • Patient C had no comorbidity at the beginning of the study, and developed asthma and liver disease as additional comorbidities, respectively at and . AC-TPC changed the temporal phenotypes assigned to this patient from Cluster 1 to Cluster 9 at and then to Cluster 3 at . The changes in the temporal phenotypes were consistent with the actual development of asthma and liver disease considering the distribution of comorbidity development in the next year – that is, Cluster 1 consists of patients who are not likely to develop any comorbidities in the next year, Cluster 9 consists of patients who are likely to develop asthma but not liver disease, and Cluster 3 consists of patients who are likely to develop asthma and liver disease in the next year.

Figure S.2: An illustration of run-time examples of AC-TPC on three representative patients.

References

  • Fortuin et al. (2019) Fortuin, V., Hüser, M., Locatello, F., Strathmann, H., and Rätsch, G. SOM-VAE: Interpretable discrete representation learning on time series. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019.
  • Konda & Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Proceedings of the 13th Conference on Neural Information Processing Systems (NIPS 2000), 2000.
  • Yang et al. (2017) Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.