1 Introduction
Chronic diseases – such as cystic fibrosis and dementia – are heterogeneous in nature, with widely differing outcomes even in narrow patient subgroups. Disease progression manifests through a broad spectrum of clinical factors, collected as a sequence of measurements in electronic health records, which gives a rise to complex progression patterns among patients (Samal et al., 2011; Yoon et al., 2017). For example, cystic fibrosis evolves slowly, allowing for development of comorbidities and bacterial infections, and creating distinct responses to therapeutic interventions, which in turn makes the survival and quality of life substantially different (Ramos et al., 2017; Lee et al., 2019). Identifying patient subgroups with similar progression patterns can be advantageous for understanding such heterogeneous diseases. This allows clinicians to anticipate patients’ prognoses by comparing to “similar” patients and to design treatment guidelines tailored to homogeneous subgroups (Zhang et al., 2019).
Temporal clustering has been recently used as a datadriven framework to partition patients with timeseries observations into subgroups of patients. Recent research has typically focused on either finding fixedlength and lowdimensional representations (Zhang et al., 2019; Rusanov et al., 2016) or on modifying the similarity measure (Giannoula et al., 2018; Luong & Chandola, 2017) both in an attempt to apply the existing clustering algorithms to timeseries observations. However, clusters identified from these approaches are purely unsupervised – they do not account for patients’ observed outcomes (e.g., adverse events, the onset of comorbidities, etc.) – which leads to heterogeneous clusters if the clinical presentation of the disease differs even for patients with the same outcomes. Thus, a common prognosis in each cluster remains unknown which can mystify the understanding of the underlying disease progression (Boudier et al., 2019; Wami et al., 2013). To overcome this limitation, we focus on predictive clustering (Blockeel et al., 2017) to combine predictions on the future outcomes with clustering. More specifically, we aim at finding cluster assignments and centroids by learning discrete representations of timeseries that best describe the future outcome distribution. By doing so, patients in the same cluster share similar future outcomes to provide a prognostic value. Figure 1 illustrates a pictorial depiction of the clustering procedure.
In this paper, we propose an actorcritic approach for temporal predictive clustering, which we call ACTPC.^{1}^{1}1Source code available at https://github.com/chl8856/AC_TPC. Our model consists of three networks – an encoder, a selector, and a predictor
– and a set of centroid candidates. The key insight, here, is that we model temporal predictive clustering as learning discrete representations of the input timeseries that best describe the future outcome distribution. More specifically, the encoder maps an input timeseries into a continuous latent encoding; the selector assigns a cluster (i.e., maps to a discrete representation) to which the input belongs by taking the latent encoding as an input; and the predictor estimates the future outcome distributions conditioned on either the encoding or the centroid of the selected cluster (i.e., the selected discrete representation). The following three contributions render our model to achieve our goal. First, to encourage homogeneous future outcomes in each cluster, we define a clustering objective based on the KullbackLeibler (KL) divergence between the predictor’s output given the timeseries, and that given the assigned centroids. Second, we transform solving a combinatorial problem of identifying clusters into iteratively solving two subproblems: optimization of the cluster assignments and optimization of the centroids. Finally, we allow “backpropagation” through the sampling process of the selector by adopting actorcritic training
(Konda & Tsitsiklis, 2000).Throughout the experiments, we show significant performance improvements over the stateoftheart clustering methods on two realworld medical datasets. To demonstrate the practical significance of our model, we consider a more realistic scenario where the future outcomes of interest are highdimensional – that is, development of multiple comorbidities in the next year – and interpreting all possible combinations is intractable. Our experiments show that our model can identify meaningful clusters that can be translated into actionable information for clinical decisionmaking.
2 Problem Formulation
Let and
be random variables for an input feature and an output label (i.e., one or a combination of future outcome(s) of interest) with a joint distribution
(and marginal distributions are and ) where is the feature space and is the label space. Here, we focus our description on class classification tasks, i.e., .^{2}^{2}2In the Supplementary Material, we discuss simple modifications for regression and dimensional binary classification tasks . We are given a timeseries dataset comprising sequences of realizations (i.e., observations) of the pair for patients. Here, is a sequence of observation pairs that correspond to patient and denotes the time stamp at which the observations are made. From this point forward, we omit the dependency on when it is clear in the context and denote .Our aim is to identify a set of predictive clusters, , for timeseries data. Each cluster consists of homogeneous data samples, that can be represented by its centroid, based on a certain similarity measure. There are two main distinctions from the conventional notion of clustering. First, we treat subsequences of each timesseries as data samples and focus on partitioning into . Hence, we define a cluster as for where is the cluster assignment for a given . This is to flexibly update the cluster assignment (in realtime) to which a patient belongs as new observations are being accrued over time. Second, we define the similarity measure with respect to the label distribution and associate it with clusters to provide a prognostic value. More specifically, we want the distribution of output label for subsequences in each cluster to be homogeneous and, thus, can be wellrepresented by the centroid of that cluster.
Let be a random variable for the cluster assignment – that depends on a given subsequence – and be a random variable for the output given cluster . Then, such property of predictive clustering can be achieved by minimizing the following KullbackLeibler (KL) divergence: for which is defined as where and are the label distributions conditioned on a subsequence and a cluster assignment , respectively. Note that the KL divergence achieves its minimum when the two distributions are equivalent.
Finally, we establish our goal as identifying a set of predictive clusters that optimizes the following objective:
(1) 
Unfortunately, the optimization problem in (1) is highly nontrivial. We need to estimate the objective function in (1) while solving a nonconvex combinatorial problem of finding the optimal cluster assignments and cluster centroids.
3 Method: ACTPC
To effectively estimate the objective function in (1), we introduce three networks – an encoder, a selector, and a predictor – and an embedding dictionary as illustrated in Figure 2
. These components together provide the cluster assignment and the corresponding centroid based on a given sequence of observations and enable us to estimate the probability density
. More specifically, we define each component as follows:
[leftmargin=1.2em]

The encoder, , is a RNN (parameterized by ) that maps a (sub)sequence of a timeseries to a latent representation (i.e., encoding) where is the latent space.

The selector, , is a fullyconnected network (parameterized by ) that provides a probabilistic mapping to a categorical distribution from which the cluster assignment is being sampled.

The predictor, , is a fullyconnected network (parameterized by ) that estimates the label distribution given the encoding of a timeseries or the centroid of a cluster.

The embedding dictionary, where for , is a set of cluster centroids lying in the latent space which represents the corresponding cluster.
Here, is a
simplex that denotes the probability distribution for a
dimensional categorical (class) variable.At each time stamp , the encoder maps a input (sub)sequence into a latent encoding . Then, based on the encoding , the cluster assignment is drawn from a categorical distribution that is defined by the selector output, i.e., where . Once the assignment is chosen, we allocate the latent encoding to an embedding in the embedding dictionary . Since the allocated embedding corresponds to the centroid of the cluster to which belongs, we can, finally, estimate the density in (1) as the output of the predictor given the embedding , i.e., .
3.1 Loss Functions
In this subsection, we define loss functions to achieve our objective in (1); the details of how we train our model will be discussed in the following subsection.
Predictive Clustering Loss: Since finding the cluster assignment of a given sequence is a probabilistic problem due to the sampling process, the objective function in (1) must be defined as an expectation over the cluster assignment. Thus, we can estimate solving the objective problem in (1) as minimizing the following loss function:
(2) 
where . Here, we slightly abuse the notation and denote
as the onehot encoding of
, and and indicates the th component of and , respectively. It is worth to highlight that minimizing is equivalent to minimizing the KL divergence in (1) since the former term of the KL divergence is independent of our optimization procedure.One critical question that may arise is how to avoid trivial solutions in this unsupervised setting of identifying the cluster assignments and the centroids (Yang et al., 2017). For example, all the embeddings in may collapse into a single point or the selector simply assigns equal probability to all the clusters regardless of the input sequence. In both cases, our model will fail to correctly estimate and, thus, end up finding a trivial solution. To address this issue, we introduce two auxiliary loss functions that are tailored to address this concern. It is worth to highlight that these loss functions are not subject to the sampling process and their gradients can be simply backpropagated.
SampleWise Entropy of Cluster Assignment: To motivate sparse cluster assignment such that the selector ultimately selects one dominant cluster for each sequence, we introduce samplewise entropy of cluster assignment which is given as
(3) 
where . The samplewise entropy achieves its minimum when
becomes an onehot vector.
Embedding Separation Loss: To prevent the embeddings in from collapsing into a single point, we define a loss function that encourages the embeddings to represent different label distributions, i.e., for , from each other:
(4) 
where is reused to quantify the distance between label distributions conditioned on each cluster. We minimize (4) when updating the embedding vectors .
(5) 
3.2 Optimization
The optimization problem in (1) is a nonconvex combinatorial problem because it comprises not only minimizing the KL divergence but also finding the optimal cluster assignments and centroids. Hence, we propose an optimization procedure that iteratively solves two subproblems: i) optimizing the three networks – the encoder, selector, and predictor – while fixing the embedding dictionary and ii) optimizing the embedding dictionary while fixing the three networks. Pseudocode of ACTPC can be found in the Supplementary Material.
3.2.1 Optimizing the Three Network
Finding predictive clusters incorporates the sampling process which is nondifferentiable. Thus, to render “backpropagation”, we utilize the training of actorcritic models (Konda & Tsitsiklis, 2000). More specifically, we view the combination of the encoder () and the selector () as the “actor” parameterized by , and the predictor () as the “critic”. The critic takes as input the the output of the actor (i.e., the cluster assignment) and estimates its value based on the samplewise predictive clustering loss (i.e., ) given the chosen cluster. This, in turn, renders the actor to change the distribution of selecting a cluster to minimize such loss. Thus, it is important for the critic to perform well on the updated output of the actor while it is important for the actor to perform well on the updated loss estimation. As such, the parameters for the actor and the critic need to be updated iteratively.
Given the embedding dictionary fixed (thus, we will omit the dependency on ), we train the actor, i.e., the encoder and the selector, by minimizing a combination of the predictive clustering loss and the entropy of cluster assignments , which is given by where is a coefficient chosen to balance between the two losses. To derive the gradient of this loss with respect , we utilize the ideas from actorcritic models (Konda & Tsitsiklis, 2000) in (S.1) which is displayed at the top; please refer to the Supplementary Material for the detailed derivation. Note that since no sampling process is considered in , we can simply derive .
Iteratively with training the actor, we train the critic, i.e., the predictor, by minimizing the predictive clustering loss as the following: whose gradient with respect to can be givens as . Note that since the critic is independent of the sampling process, the gradient can be simply backpropagated.
3.2.2 Optimizing the Cluster Centroids
Now, once the parameters for the three networks are fixed (thus, we omit the dependency on , , and ), we updated the embeddings in by minimizing a combination of the predictive clustering loss and the embedding separation loss , which is given by where is a coefficient chosen to balance between the two losses.
3.2.3 Initializing ACTPC via PreTraining
Since we transform the combinatorial optimization problem in (
1) into iteratively solving two subproblems, initialization is crucial to achieve better optimization as a similar concern has been addressed in (Yang et al., 2017).Therefore, we initialize our model based on the following procedure. First, we pretrain the encoder and the predictor by minimizing the following loss function based on the predicted label distribution given the latent encodings of input sequences, i.e., , as the following:
(6) 
Minimizing (6) encourages the latent encoding to be enriched with information for accurately predicting the label distribution. Then, we perform means (other clustering method can be also applied) based on the learned representations to initialize the embeddings and the cluster assignments . Finally, we pretrain the selector by minimizing the cross entropy treating the initialized cluster assignments as the true clusters.
4 Related Work
Temporal clustering, also known as timeseries clustering, is a process of unsupervised partitioning of the timeseries data into clusters in such a way that homogeneous timeseries are grouped together based on a certain similarity measure. Temporal clustering is challenging because i) the data is often highdimensional – it consists of sequences not only with highdimensional features but also with many time points – and ii) defining a proper similarity measure for timeseries is not straightforward since it is often highly sensitive to distortions (Ratanamahatana et al., 2005). To address these challenges, there have been various attempts to find a good representation with reduced dimensionality or to define a proper similarity measure for timesseries (Aghabozorgi et al., 2015).
Recently, (Baytas et al., 2017) and (Madiraju et al., 2018)
proposed temporal clustering methods that utilize lowdimensional representations learned by RNNs. These works are motivated by the success of applying deep neural networks to find “clustering friendly” latent representations for clustering static data
(Xie et al., 2017; Yang et al., 2017). In particular, Baytas et al. (2017) utilized a modified LSTM autoencoder to find the latent representations that are effective to summarize the input timeseries and conducted means on top of the learned representations as an adhoc process. Similarly, (Madiraju et al., 2018) proposed a bidirectionalLSTM autoencoder that jointly optimizes the reconstruction loss for dimensionality reduction and the clustering objective. However, these methods do not associate a target property with clusters and, thus, provide little prognostic value about the underlying disease progression.Our work is most closely related to SOMVAE (Fortuin et al., 2019)
. This method jointly optimizes a static variational autoencoder (VAE), that finds latent representations of input features, and a selforganizing map (SOM), that allows to map the latent representations into a more interpretable discrete representations, i.e., the embeddings. However, there are three key differences between our work and SOMVAE. First, SOMVAE aims at minimizing the reconstruction loss that is specified as the mean squared error between the original input and the reconstructed input based on the corresponding embedding. Thus, similar to the aforementioned methods, SOMVAE neither associates future outcomes of interest with clusters. In contrast, we focus on minimizing the KL divergence between the outcome distribution given the original input sequence and that given the corresponding embedding to build association between future outcomes of interest and clusters. Second, to overcome nondifferentiability caused by the sampling process (that is, mapping the latent representation to the embeddings),
(Fortuin et al., 2019) applies the gradient copying technique proposed by (van den Oord et al., 2017), while we utilize the training of actorcritic model (Konda & Tsitsiklis, 2000). Finally, while we flexibly model timeseries using LSTM, SOMVAE handles timeseries by integrating a Markov model in the latent representations. This can be a strict assumption especially in clinical settings where a patient’s medical history is informative for predicting the future clinical outcomes
(Ranganath et al., 2016).5 Experiments
In this section, we provide a set of experiments using two realworld timeseries datasets. We iteratively update the three networks – the encoder, selector, and predictor – and the embedding dictionary as described in Section 3.2. For the network architecture, we constructed the encoder utilizing a singlelayer LSTM (Hochreiter & Schmidhuber, 1997) with 50 nodes and constructed the selector and predictor utilizing twolayer fullyconnected network with 50 nodes in each layer, respectively. The parameters are initialized by Xavier initialization (Glorot & Bengio, 2010) and optimized via Adam optimizer (Kingma & Ba, 2014) with learning rate of and keep probability . We chose the balancing coefficients utilizing grid search that achieves the minimum validation loss in (2); the effect of different loss functions are further investigated in the experiments. Here, all the results are reported using 5 random 64/16/20 train/validation/test splits.
5.1 RealWorld Datasets
We conducted experiments to investigate the performance of ACTPC on two realworld medical datasets; detailed statistics of each dataset can be found in the Supplementary Material.
UK Cystic Fibrosis registry (UKCF)^{3}^{3}3https://www.cysticfibrosis.org.uk: This dataset records annual followups for 5,171 adult patients (aged 18 years or older) enrolled in the UK CF registry over the period from 2008 and 2015, with a total of 25,012 hospital visits. Each patient is associated with 89 variables (i.e., 11 static and 78 timevarying features), including information on demographics and genetic mutations, bacterial infections, lung function scores, therapeutic managements, and diagnosis on comorbidities. We set the development of different comorbidities in the next year as the label of interest at each time stamp.
Alzheimer’s Disease Neuroimaging Initiative (ADNI)^{4}^{4}4https://adni.loni.usc.edu: This dataset consists of 1,346 patients in the Alzheimer’s disease study with a total of 11,651 hospital visits, which tracks the disease progression via followup observations at 6 months interval. Each patient is associated with 21 variables (i.e., 5 static and 16 timevarying features), including information on demographics, biomarkers on brain functions, and cognitive test results. We set predictions on the three diagnostic groups – normal brain functioning, mild cognitive impairment, and Alzheimer’s disease – as the label of interest at each time stamp.
5.2 Benchmarks
We compare ACTPC with clustering methods ranging from conventional approaches based on means to the stateoftheart approaches based on deep neural networks. All the benchmarks compared in the experiments are tailored to incorporate timeseries data as described below:
Dynamic time warping followed by means: Dynamic time warping (DTW) is utilized to quantify pairwise distance between two variablelength sequences and, then, means is applied (KMDTW).
means with deep neural networks: To handle variablelength timeseries data, we utilize our encoder and predictor that are trained based on (6) for fixedlength dimensionality reduction. Then, we apply means on the latent encodings (KME2P ()) and on the predicted label distributions (KME2P ()), respectively.
Extensions of DCN (Yang et al., 2017): Since the DCN is designed for static data, we replace their static autoencoder with a sequencetosequence network to incorporate timeseries data (DCNS2S).^{5}^{5}5This extension is a representative of recent deep learning approaches for clustering of both static data (Xie et al., 2017; Yang et al., 2017) and timeseries data (Baytas et al., 2017; Madiraju et al., 2018) since these methods are built upon the same concept – that is, applying deep networks for dimensionality reduction to conduct conventional clustering methods, e.g., means. To associated with the label distribution, we compare a DCN whose static autoencoder is replaced with our encoder and predictor (DCNE2P) to focus dimensionality reduction while preserving information for label prediction.
SOMVAE (Fortuin et al., 2019): We compare with SOMVAE – though, this method aims at visualizing input – since it naturally clusters timeseries data (SOMVAE). In addition, we compare with a variation of SOMVAE by replacing the decoder with our predictor to find embeddings that capture information for predicting the label (SOMVAEP). For both cases, we set the dimension of SOM to .
It is worth highlighting that the label information is provided for training DCNE2P, KME2P, and SOMVAEP while the label information is not provided for training KMDTW, DCNS2S, and SOMVAE. Please refer to the Supplementary Material for the summary of major components of the tested benchmarks and the implementation details.
5.3 Performance Metrics
Clustering Performance: We applied the following three standard metrics for evaluating clustering performances when the groundtruth cluster label is available: purity score, normalized mutual information (NMI) (Vinh et al., 2010), and adjusted Rand index (ARI) (Hubert & Arabie, 1985). More specifically, the purity score assesses how homogeneous each cluster is (ranges from 0 to 1 where 1 being a cluster consists of a single class), the NMI is an information theoretic measure of how much information is shared between the clusters and the labels that is adjusted for the number of clusters (ranges from 0 to 1 where 1 being a perfect clustering), and ARI is a correctedforchance version of the Rand index which is a measure of the percentage of correct cluster assignments (ranges from 1 to 1 where 1 being a perfect clustering and 0 being a random clustering).
When the groundtruth label is not available, we utilize the average Silhouette index (SI) (Rousseeuw, 1987) which measures how similar a member is to its own cluster (homogeneity within a cluster) compared to other clusters (heterogeneity across clusters). Formally, the SI for a subsequence can be given as follows: where and . Here, we used the L1distance between the groundtruth labels of the future outcomes of interest since our goal is to group input subsequences with similar future outcomes.
Prediction Performance: To assess the prediction performance of the identified predictive clusters, we utilized both a
rea under receiver operator characteristic curve
(AUROC) and area under precisionrecall curve (AUPRC) based on the label predictions of each cluster and the groundtruth binary labels on the future outcomes of interest. Note that the prediction performance is available only for the benchmarks that incorporate the label information during training.5.4 Clustering Performance
We start with a simple scenario where the true class (i.e., the groundtruth cluster label) is available and the number of classes is tractable. In particular, we set based on the binary labels for the development of three common comorbidities of cystic fibrosis – diabetes, ABPA, and intestinal obstruction – in the next year for the UKCF dataet and based on the mutually exclusive three diagnostic groups for the ADNI dataset. We compare ACTPC against the aforementioned benchmarks with respect to the clustering and prediction performance in Table 1.
Dataset  Method  Purity  NMI  ARI  AUROC  AUPRC 
UKCF  KMDTW  0.5730.01  0.0100.01  0.0140.01  N/A  N/A 
KME2P ()  0.7190.01  0.2110.01  0.1070.01  0.7260.01  0.4250.02  
KME2P ()  0.7510.01  0.3250.01  0.4400.02  0.8070.00  0.5140.01  
DCNS2S  0.6070.06  0.0590.08  0.0630.09  N/A  N/A  
DCNE2P  0.7510.02  0.2750.02  0.1840.01  0.7720.03  0.4870.03  
SOMVAE  0.5730.01  0.0060.00  0.0060.01  N/A  N/A  
SOMVAEP  0.6380.04  0.2010.05  0.2830.17  0.7540.05  0.3310.07  
Proposed  0.8070.01  0.4630.01  0.6020.01  0.8430.01  0.6050.01  
ADNI  KMDTW  0.5660.02  0.0190.02  0.0060.02  N/A  N/A 
KME2P ()  0.7360.03  0.2490.02  0.2300.03  0.7070.01  0.5090.01  
KME2P ()  0.7760.05  0.2640.07  0.3170.11  0.7560.04  0.5030.04  
DCNS2S  0.5670.02  0.0050.00  0.0000.01  N/A  N/A  
DCNE2P  0.7490.06  0.2610.05  0.2150.06  0.7210.03  0.5090.03  
SOMVAE  0.5660.02  0.0400.06  0.0110.02  N/A  N/A  
SOMVAEP  0.5860.06  0.0850.08  0.0380.06  0.5970.10  0.3760.05  
Proposed  0.7860.03  0.2850.04  0.3300.06  0.7680.02  0.5150.02  
indicates , indicates 
As shown in Table 1, ACTPC achieved performance gain over all the tested benchmarks in terms of both clustering and prediction performance – where most of the improvements were statistically significant with or – for both datasets. Importantly, clustering methods – i.e., KMDTW, DCNS2S, and SOMVAE – that do not associate with the future outcomes of interest identified clusters that provide little prognostic value on the future outcomes (note that the true class is derived from the future outcome of interest). This is clearly shown by the ARI value near 0 which indicates that the identified clusters have no difference with random assignments. Therefore, similar sequences with respect to the latent representations tailored for reconstruction or with respect to the shapebased measurement using DTW can have very different outcomes.
The purity score, NMI, and ARI (mean and 95% confidence interval) for the UKCF dataset (
) with various .In Figure 3, we further investigate the purity score, NMI, and ARI by varying the number of clusters from to on the UKCF dataset in the same setting with that stated above (i.e., ). Here, the three methods – i.e., KMDTW, DCNS2S, and SOMVAE – are excluded for better visualization. As we can see in Figure 3, our model rarely incur performance loss in both NMI and ARI while the benchmarks (except for SOMVAEP) showed significant decrease in the performance as increased (higher than ). This is because the number of clusters identified by ACTPC (i.e., the number of activated clusters where we define cluster is activated if ) was the same with most of the times, while the DCNbased methods identified exactly clusters (due to the means). Since the NMI and ARI are adjusted for the number of clusters, a smaller number of identified clusters yields, if everything being equal, a higher performance. In contrast, while our model achieved the same purity score for , the benchmark showed improved performance as increased since the purity score does not penalize having many clusters. This is an important property of ACTPC that we do not need to know a priori what the number of cluster is which is a common practical challenge of applying the conventional clustering methods (e.g., means).
The performance gain of our model over SOMVAEP (and, our analysis is the same for SOMVAE) comes from two possible sources: i) SOMVAEP mainly focuses on visualizing the input with SOM which makes both the encoder and embeddings less flexible – this is why it performed better with higher – and ii) the Markov property can be too strict for timeseries data especially in clinical settings where a patient’s medical history is informative for predicting the future clinical outcomes (Ranganath et al., 2016).
5.5 Multiple Future Outcomes – a Practical Scenario
In this experiment, we focus on a more practical scenario where the future outcome of interest is highdimensional and, thus, the number of classes based on all the possible combinations of future outcomes becomes intractable. Suppose that we are interested in the development of comorbidities in the next year whose possible combinations grow exponentially . Interpreting such a large number of patient subgroups will be a daunting task which hinders the understanding of underlying disease progression. Since different comorbidities may share common driving factors (Ronan et al., 2017), we hope our model to identify much smaller underlying (latent) clusters that govern the development of comorbidities. Here, to incorporate with comorbidities (i.e., binary labels), we redefine the output space as and modify the predictor and loss functions, accordingly.
We identified clusters of patients based on the nextyear development of different comorbidities in the UKCF dataset and reported clusters in Figure 5 – Cluster 0, 5, 7, 8, and 10 – with the frequency of developing important comorbidities in the next year. Here, we selected the clusters that have the highest risk of developing diabetes in the next year, and the frequency is calculated in a clusterspecific fashion using the true label. A full list of clusters and comorbidity frequencies can be found in the Supplementary Material. Although all these clusters displayed high risk of diabetes, the frequency of other cooccurred comorbidities was significantly different across the clusters. In particular, around 89% of the patients in Cluster 5 experienced asthma in the next year while it was less than 3% of the patients in the other cluster. Interestingly, “leukotriene” – a medicine commonly used to manage asthma – and “FEV% predicted” – a measure of lung function – were the two most different input features between patients in Cluster 5 and those in the other clusters. We observed similar findings in Cluster 7 with ABPA, Cluster 8 with liver disease, and Cluster 10 with osteopenia. Therefore, by grouping patients who are likely to develop a similar set of comorbidities, our method identified clusters that can be translated into actionable information for clinical decisionmaking.
5.6 TradeOff between Clustering and Prediction
In predictive clustering, the tradeoff between the clustering performance (for better interpretability) – which quantifies how the data samples are homogeneous within each cluster and heterogeneous across clusters with respect to the future outcomes of interest – and the prediction performance is a common issue. The most important parameter that governs this tradeoff is the number of clusters. More specifically, increasing the number of clusters will make the predictive clusters have higher diversity to represent the output distribution and, thus, will increase the prediction performance while decreasing the clustering performance. One extreme example is that there are as many clusters as data samples which will make the identified clusters fully individualized; as a consequence, each cluster will lose interpretability as it no longer groups similar data samples.
To highlight this tradeoff, we conduct experiments under the same experimental setup with that of Section 5.5. For the performance measures, we utilized the AUROC and AUPRC to assess the prediction performance, and utilized the average SI to assess the clustering performance. To control the number of activated clusters, we set and (since the embedding separation loss in (4) controls the activation of clusters) and reported the performance by increasing the number of possible clusters , i.e., the dimension of the embedding dictionary.
As can be seen in Figure 4, the prediction performance increased with a increasing number of identified clusters due to the higher diversity to represent the label distribution while making the identified clusters less interpretable. That is, the cohesion and separation among clusters become ambiguous as shown in the low average SI. On the other hand, when we set (which is selected based on the validation loss in 2), our method consistently identified a similar number of clusters for , i.e., 13.8 on average, in a datadriven fashion and provided slightly reduced prediction performance with significantly better interpretability, i.e., the average SI on average.
6 Conclusion
In this paper, we introduced ACTPC, a deep learning approach for predictive clustering of timeseries data. We defined novel loss functions to encourage each cluster to have homogeneous future outcomes (e.g., adverse events, the onset of comorbidities, etc.) and designed optimization procedures to avoid trivial solutions in identifying cluster assignments and the centroids. Throughout the experiments on two realworld datasets, we showed that our model achieves superior clustering performance over stateoftheart methods and identifies meaningful clusters that can be translated into actionable information for clinical decisionmaking.
Acknowledgements
The authors would like to thank the reviewers for their helpful comments. This work was supported by the National Science Foundation (NSF grants 1524417 and 1722516), the US Office of Naval Research (ONR), and the UK Cystic Fibrosis Trust. We thank Dr. Janet Allen (Director of Strategic Innovation, UK Cystic Fibrosis Trust) for the vision and encouragement. We thank Rebecca Cosgriff and Elaine Gunn for the help with data access, extraction and analysis. We also thank Prof. Andres Floto and Dr. Tomas Daniels, our collaborators, for the very helpful clinical discussions.
References
 Aghabozorgi et al. (2015) Aghabozorgi, S., Shirkhorshidi, A. S., and Wah, T. Y. Timeseries clustering – a decade review. Information Systems, 53:16–38, May 2015.
 Baytas et al. (2017) Baytas, I. M., Xiao, C., Zhang, X., Wang, F., Jain, A. K., and Zhou, J. Patient subtyping via timeaware lstm networks. In Proceedings of the 23rd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2017), 2017.
 Blockeel et al. (2017) Blockeel, H., Dzeroski, S., Struyf, J., and Zenko, B. Predictive Clustering. Springer New York, 2017.
 Boudier et al. (2019) Boudier, A., Chanoine, S., Accordini, S., Anto, J. M., na, X. B., Bousquet, J., Demoly, P., GarciaAymerich, J., Gormand, F., Heinrich, J., Janson, C., Künzli, N., Matran, R., Pison, C., Raherison, C., Sunyer, J., Varraso, R., Jarvis, D., Leynaert, B., Pin, I., and Siroux, V. Data‐driven adult asthma phenotypes based on clinical characteristics are associated with asthma outcomes twenty years later. Allegy, 74(5):953–963, May 2019.
 Fortuin et al. (2019) Fortuin, V., Hüser, M., Locatello, F., Strathmann, H., and Rätsch, G. SOMVAE: Interpretable discrete representation learning on time series. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019.
 Giannoula et al. (2018) Giannoula, A., GutierrezSacristán, A., Bravo, A., Sanz, F., and Furlong, L. I. Identifying temporal patterns in patient disease trajectories using dynamic ping: A populationbased study. Scientific Reports, 8(4216):1–14, March 2018.

Glorot & Bengio (2010)
Glorot, X. and Bengio, Y.
Understanding the difficulty of training deep feedforward neural
networks.
In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010)
, 2010.  Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Hubert & Arabie (1985) Hubert, L. and Arabie, P. Comparing partitions. Journal of Classification, 2(1):193–218, December 1985.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Konda & Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. Actorcritic algorithms. In Proceedings of the 13th Conference on Neural Information Processing Systems (NIPS 2000), 2000.
 Lee et al. (2019) Lee, C., Yoon, J., and van der Schaar, M. Dynamicdeephit: A deep learning approach for dynamic survival analysis with competing risks based on longitudinal data. IEEE Transactions on Biomedical Engineering, April 2019.

Luong & Chandola (2017)
Luong, D. T. A. and Chandola, V.
A kmeans approach to clustering disease progressions.
In Proceedings of the 5th IEEE International Conference on Healthcare Informatics (ICHI), 2017.  Madiraju et al. (2018) Madiraju, N. S., Sadat, S. M., Fisher, D., and Karimabadi, H. Deep temporal clustering: Fully unsupervised learning of timedomain features. arXiv preprint arXiv:1802.01059, 2018.
 Ramos et al. (2017) Ramos, K. J., Quon, B. S., Heltshe, S. L., MayerHamblett, N., Lease, E. D., Aitken, M. L., Weiss, N. S., and Goss, C. H. Heterogeneity in survival in adult patients with cystic fibrosis with FEV of predicted in the united states. Chest, 151(6):1320–1328, June 2017.
 Ranganath et al. (2016) Ranganath, R., Perotte, A., Elhadad, N., and Blei, D. Deep survival analysis. In Proceedings of the 1st Machine Learning for Healthcare Conference (MLHC 2016), 2016.
 Ratanamahatana et al. (2005) Ratanamahatana, C. A., Keogh, E., Bagnall, A. J., and Lonardi, S. A novel bit level time series representation with implications for similarity search and clustering. In Proceedings of the 9th PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD 2005), 2005.
 Ronan et al. (2017) Ronan, N. J., Elborn, J., and Plant, B. J. Current and emerging comorbidities in cystic fibrosis. Presse Med., 46(6):125–138, June 2017.

Rousseeuw (1987)
Rousseeuw, P. J.
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.
Computational and Applied Mathematics, 20:53–65, 1987.  Rusanov et al. (2016) Rusanov, A., Prado, P. V., and Weng, C. Unsupervised timeseries clustering over lab data for automatic identification of uncontrolled diabetes. In Proceedings of the 4th IEEE International Conference on Healthcare Informatics (ICHI), 2016.
 Samal et al. (2011) Samal, L., Wright, A., Wong, B., Linder, J., and Bates, D. Leveraging electronic health records to support chronic disease management: the need for temporal data views. Informatics in Primary Care, 19(2):65–74, 2011.
 van den Oord et al. (2017) van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017.
 Vinh et al. (2010) Vinh, N. X., Epps, J., and Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11(1):2837–2854, October 2010.
 Wami et al. (2013) Wami, W. M., Buntinx, F., Bartholomeeusen, S., Goderis, G., Mathieu, C., and Aerts, M. Influence of chronic comorbidity and medication on the efficacy of treatment in patients with diabetes in general practice. The British Journal of General Practice, 63(609):267–273, March 2013.
 Xie et al. (2017) Xie, J., Girshick, R., and Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the 33rd International Conference on Machine Learning (ICML 2016), 2017.
 Yang et al. (2017) Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M. Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.
 Yoon et al. (2017) Yoon, J., Davtyan, C., and van der Schaar, M. Discovery and clinical decision support for personalized healthcare. IEEE J Biomed Health Inform., 21(4):1133–1145, 2017.
 Zhang et al. (2019) Zhang, X., Chou, J., Liang, J., Xiao, C., Zhao, Y., Sarva, H., Henchcliffe, C., and Wang, F. Datadriven subtyping of parkinson’s disease using longitudinal clinical records: A cohort study. Scientific Reports, 9(797):1–12, January 2019.
Appendix A ACTPC for Regression and Binary Classification Tasks
As the task changes, estimating the label distribution and calculating the KL divergence in (1) of the manuscript must be redefined accordingly: For regression task, i.e., , we modify the predictor as and replace by . Minimizing is equivalent to minimizing the KL divergence between and
when we assume these probability densities follow Gaussian distribution with the same variance. For the
dimensional binary classification task, i.e., , we modify the predictor as and replace by which is required to minimize the KL divergence. Here, and indicate the th element of and , respectively. The basic assumption here is that the distribution of each binary label is independent given the input sequence.Appendix B Detailed Derivation of (5)
To derive the gradient of the predictive clustering loss in (5) of the manuscript with respect , we utilized the ideas from actorcritic models (Konda & Tsitsiklis, 2000) on :
(S.1) 
where the second equality comes from the following derivation of the former term:
Appendix C PseudoCode of ACTPC
As illustrated in Section 3.2, ACTPC is trained in an iterative fashion. We provide the pseudocode for optimizing our model in Algorithm 1 and that for initializing the parameters in Algorithm 2.
STATIC COVARIATES  Type  Mean  Type  Mean  
Demographic  Gender  Bin.  0.55  
Genetic  Class I Mutation  Bin.  0.05  Class VI Mutation  Bin.  0.86  
Class II Mutation  Bin.  0.87  DF508 Mutation  Bin.  0.87  
Class III Mutation  Bin.  0.89  G551D Mutation  Bin.  0.06  
Class IV Mutation  Bin.  0.05  Homozygous  Bin.  0.58  
Class V Mutation  Bin.  0.04  Heterozygous  Bin  0.42  
TIMEVARYING COVARIATES  Type  Mean  Min / Max  Type  Mean  Min / Max  
Demographic  Age  Cont.  30.4  18.0 / 86.0  Height  Cont.  168.0  129.0 / 198.6 
Weight  Cont.  64.1  24.0 / 173.3  BMI  Cont.  22.6  10.9 / 30.0  
Smoking Status  Bin.  0.1  
Lung Func. Scores  FEV  Cont.  2.3  0.2 / 6.3  Best FEV  Cont.  2.5  0.3 / 8.0 
FEV% Pred.  Cont.  65.1  9.0 / 197.6  Best FEV% Pred.  Cont.  71.2  7.5 / 164.3  
Hospitalization  IV ABX Days Hosp.  Cont.  12.3  0 / 431  NonIV Hosp. Adm.  Cont.  1.2  0 / 203 
IV ABX Days Home  Cont.  11.9  0 / 441  
Lung Infections  B. Cepacia  Bin.  0.05  P. Aeruginosa  Bin.  0.59  
H. Influenza  Bin.  0.05  K. Pneumoniae  Bin.  0.00  
E. Coli  Bin.  0.01  ALCA  Bin.  0.03  
Aspergillus  Bin.  0.14  NTM  Bin.  0.03  
GramNegative  Bin.  0.01  Xanthomonas  Bin.  0.05  
S. Aureus  Bin.  0.30  
Comorbidities  Liver Disease  Bin.  0.16  Depression  Bin.  0.07  
Asthma  Bin.  0.15  Hemoptysis  Bin.  0.01  
ABPA  Bin.  0.12  Pancreatitus  Bin.  0.01  
Hypertension  Bin.  0.04  Hearing Loss  Bin.  0.03  
Diabetes  Bin.  0.28  Gall bladder  Bin.  0.01  
Arthropathy  Bin.  0.09  Colonic structure  Bin.  0.00  
Bone fracture  Bin.  0.01  Intest. Obstruction  Bin.  0.08  
Osteoporosis  Bin.  0.09  GI bleed – no var.  Bin.  0.00  
Osteopenia  Bin.  0.21  GI bleed – var.  Bin.  0.00  
Cancer  Bin.  0.00  Liver Enzymes  Bin.  0.16  
Cirrhosis  Bin.  0.03  Kidney Stones  Bin.  0.02  
Treatments  Dornase Alpha  Bin.  0.56  Inhaled B. BAAC  Bin.  0.03  
Antifungals  Bin.  0.07  Inhaled B. LAAC  Bin.  0.08  
HyperSaline  Bin.  0.23  Inhaled B. SAAC  Bin.  0.05  
HypertonicSaline  Bin.  0.01  Inhaled B. LABA  Bin.  0.11  
Tobi Solution  Bin.  0.20  Inhaled B. Dilators  Bin.  0.57  
Cortico Combo  Bin.  0.41  Cortico Inhaled  Bin.  0.15  
NonIV Ventilation  Bin.  0.05  Oral B. Theoph.  Bin.  0.04  
Acetylcysteine  Bin.  0.02  Oral B. BA  Bin.  0.03  
Aminoglycoside  Bin.  0.03  Oral Hypo. Agents  Bin.  0.01  
iBuprofen  Bin.  0.00  Chronic Oral ABX  Bin.  0.53  
Drug Dornase  Bin.  0.02  Cortico Oral  Bin.  0.14  
HDI Buprofen  Bin.  0.00  Oxygen Therapy  Bin.  0.11  
Tobramycin  Bin.  0.03  O Exacerbation  Bin.  0.03  
Leukotriene  Bin.  0.07  O Nocturnal  Bin.  0.03  
Colistin  Bin.  0.03  O Continuous  Bin.  0.03  
Diabetes Insulin  Bin.  0.01  O Pro re nata  Bin.  0.01  
Macrolida ABX  Bin.  0.02  
ABX: antibiotics 
STATIC COVARIATES  Type  Mean  Min/Max (Mode)  Type  Mean  Min/Max (Mode)  
Demographic  Race  Cat.  0.93  White  Ethnicity  Cat.  0.97  No Hisp/Latino 
Education  Cat.  0.23  C16  Marital Status  Cat.  0.75  Married  
Genetic  APOE  Cont.  0.44  0/2  
TIMEVARYING COVARIATES  Type  Mean  Min / Max  Type  Mean  Min / Max  
Demographic  Age  Cont.  73.6  55/92  
Biomarker  Entorhinal  Cont.  3.6E+3  1.0E+3 / 6.7E+3  Mid Temp  Cont.  2.0E+4  8.9E+3 / 3.2E+4 
Fusiform  Cont.  1.8E+5  9.0E+4 / 2.9E+5  Ventricles  Cont.  4.1E+4  5.7E+3 / 1.6E+5  
Hippocampus  Cont.  6.9E+3  2.8E+3 / 1.1E+4  Whole Brain  Cont.  1.0E+6  6.5E+5 / 1.5E+6  
Intracranial  Cont.  1.5E+6  2.9E+2 / 2.1E+6  
Cognitive  ADAS11  Cont.  8.58  0/70  ADAS13  Cont.  13.61  0/85 
CRD Sum of Boxes  Cont.  1.21  0/17  Mini Mental State  Cont.  27.84  2/30  
RAVLT Forgetting  Cont.  4.19  12/15  RAVLT Immediate  Cont.  38.25  0/75  
RAVLT Learning  Cont.  4.65  5/14  RAVLT Percent  Cont.  51.70  500/100 
Appendix D Details of the Datasets
d.1 UKCF Dataset
UK Cystic Fibrosis registry (UKCF)^{6}^{6}6https://www.cysticfibrosis.org.uk/theworkwedo/ukcfregistry records annual followups for 5,171 adult patients (aged 18 years or older) over the period from 2008 and 2015, with a total of 25,012 hospital visits. Each patient is associated with 89 variables (i.e., 11 static and 78 timevarying features), including information on demographics and genetic mutations, bacterial infections, lung function scores, therapeutic managements, and diagnosis on comorbidities. The detailed statistics are given in Table S.1.
d.2 ADNI Dataset
Alzheimer’s Disease Neuroimaging Initiative (ADNI)^{7}^{7}7https://adni.loni.usc.edu study consists of 1,346 patients with a total of 11,651 hospital visits, which tracks the disease progression via followup observations at 6 months interval. Each patient is associated with 21 variables (i.e., 5 static and 16 timevarying features), including information on demographics, biomarkers on brain functions, and cognitive test results. The three diagnostic groups were normal brain functioning (), mild cognitive impairment (), and Alzheimer’s disease (). The detailed statistics are given in Table S.2.
Methods  Handling TimeSeries  Clustering Method  Similarity Measure  Label Provided  Label Associated 
KMDTW  DTW  means  DTW  N  N 
KME2P ()  RNN  means  Euclidean in  Y  Y (indirect) 
KME2P ()  RNN  means  Euclidean in  Y  Y (direct) 
DCNS2S  RNN  means  Euclidean in  N  N 
DCNE2P  RNN  means  Euclidean in  Y  Y (indirect) 
SOMVAE  Markov model  embedding mapping  reconstruction loss  N  N 
SOMVAEP  Markov model  embedding mapping  prediction loss  Y  Y (direct) 
Proposed  RNN  embedding mapping  KL divergence  Y  Y (direct) 
Appendix E Details of the Benchmarks
We compared ACTPC in the experiments with clustering methods ranging from conventional approaches based on means to the stateoftheart approaches based on deep neural networks. The details of how we implemented the benchmarks are described as the following:

[leftmargin=1.5em]

Dynamic time warping followed by means^{8}^{8}8https://github.com/rtavenar/tslearn: Dynamic time warping (DTW) is utilized to quantify pairwise distance between two variablelength sequences and, then, means is applied (denoted as KMDTW).

means with deep neural networks: To handle variablelength timeseries data, we utilized an encoderpredictor network as depicted in Figure 0(b) and trained the network based on (6) for dimensionality reduction; this is to provide fixedlength and lowdimensional representations for timeseries. Then, we applied means on the latent encodings (denoted as KME2P ()) and on the predicted label distributions (denoted as KME2P ()), respectively. We implemented the encoder and predictor of KME2P with the same network architectures with those of our model: the encoder is a singlelayer LSTM with 50 nodes and the decoder is a twolayered fullyconnected network with 50 nodes in each layer.

Extensions of DCN^{9}^{9}9https://github.com/boyangumn/DCN (Yang et al., 2017): Since the DCN is designed for static data, we utilized a sequencetosequence model in Figure 0(a) for the encoderdecoder network as an extension to incorporate timeseries data (denoted as DCNS2S) and trained the network based on the reconstruction loss (using the reconstructed input sequence ). For implementing DCNS2S, we used a singlelayer LSTM with 50 nodes for both the encoder and the decoder. And, we augmented a fullyconnected layer with 50 nodes is used to reconstruct the original input sequence from the latent representation of the decoder.
In addition, since predictive clustering is associated with the label distribution, we compared a DCN whose encoderdecoder structure is replaced with our encoderpredictor network in Figure 0(b) (denoted as DCNE2P) to focus the dimensionality reduction – and, thus, finding latent encodings where clustering is performed – on the information for predicting the label distribution. We implemented the encoder and predictor of DCNE2P with the same network architectures with those of our model as described in Section 5.

SOMVAE^{10}^{10}10https://github.com/ratschlab/SOMVAE (Fortuin et al., 2019): We compare with SOMVAE – though, this method is oriented towards visualization of input data via SOM – since it naturally clusters timeseries data assuming Markov property (denoted as SOMVAE). We replace the original CNN architecture of the encoder and the decoder with threelayered fullyconnected network with 50 nodes in each layer, respectively. The network architecture is depicted in Figure 0(c) where and indicate the reconstructed inputs based on the encoding and the embedding at time , respectively.
In addition, we compare with a variation of SOMVAE by replacing the decoder with the predictor to encourage the latent encoding to capture information for predicting the label distribution (denoted as SOMVAEP). For the implementation, we replaced the decoder of SOMVAE with our predictor which is a twolayered fullyconnected layer with 50 nodes in each layer to predict the label distribution as illustrated in Figure 0(d). Here, and indicate the predicted labels based on the encoding and the embedding at time , respectively.
For both cases, we used the default values for balancing coefficients of SOMVAE and the dimension of SOM to be equal to .
We compared and summarized major components of the benchmarks in Table S.3.
Appendix F Additional Experiments
f.1 Contributions of the Auxiliary Loss Functions
As described in Section 3.1, we introduced two auxiliary loss functions – the samplewise entropy of cluster assignment (3) and the embedding separation loss (4) – to avoid trivial solution that may arise in identifying the predictive clusters. To analyze the contribution of each auxiliary loss function, we report the average number of activated clusters, clustering performance, and prediction performance on the UKCF dataset with comorbidities as described in Section 5.4. Throughout the experiment, we set – which is larger than – to find the contribution of these loss functions to the number of activated clusters.
Coefficients  Clustering Performance  Prognostic Value  
Activated No.  Purity  NMI  ARI  AUROC  AUPRC  
0.0  0.0  16  0.5730.01  0.0060.00  0.0000.00  0.5000.00  0.1690.00 
0.0  1.0  16  0.5730.01  0.0060.00  0.0000.00  0.5000.00  0.1690.00 
3.0  0.0  8.4  0.7950.01  0.4310.01  0.5690.01  0.8400.01  0.5830.02 
3.0  1.0  8  0.8080.01  0.4680.01  0.6060.01  0.8520.00  0.6080.01 
As we can see in Table S.4, both auxiliary loss functions make important contributions in improving the quality of predictive clustering. More specifically, the samplewise entropy encourages the selector to choose one dominant cluster. Thus, as we can see results with , without the samplewise entropy, our selector assigns an equal probability to all clusters which results in a trivial solution. We observed that, by augmenting the embedding separation loss (4), ACTPC identifies a smaller number of clusters owing to the wellseparated embeddings.
f.2 Additional Results on Targeting Multiple Future Outcomes
Throughout the experiment in Section 5.5, we identified subgroups of patients that are associated with the nextyear development of different comorbidities in the UKCF dataset. In Table S.5, we reported identified clusters and the full list of comorbidities developed in the next year since the latest observation and the corresponding frequency which is calculated in a clusterspecific fashion based on the true label.
As we can see in the table, the identified clusters displayed very different label distributions; that is, the combination of comorbidities as well as their frequency were very different across the clusters. For example, patients in Cluster 1 experienced lowrisk of developing any comorbidities in the next year while 85% of patients in Cluster 0 experienced diabetes in the next year.
Clusters  Comorbidities and Frequencies  
Cluster 0  Diabetes (0.85)  Liver Enzymes (0.21)  Arthropathy (0.14)  Depression (0.10) 
Hypertens (0.08)  Osteopenia (0.07)  Intest. Obstruction (0.07)  Cirrhosis (0.04)  
ABPA (0.04)  Liver Disease (0.04)  Osteoporosis (0.03)  Hearing Loss (0.03)  
Asthma (0.02)  Kidney Stones (0.01)  Bone fracture (0.01)  Hemoptysis (0.01)  
Pancreatitis (0.01)  Cancer (0.00)  Gall bladder (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 1  Liver Enzymes (0.09)  Arthropathy (0.08)  Depression (0.07)  Intest. Obstruction (0.06) 
Diabetes (0.06)  Osteopenia (0.05)  ABPA (0.04)  Asthma (0.03)  
Liver Disease (0.03)  Hearing Loss (0.03)  Osteoporosis (0.02)  Pancreatitis (0.02)  
Kidney Stones (0.02)  Hypertension (0.01)  Cirrhosis (0.01)  Gall bladder (0.01)  
Cancer (0.01)  Hemoptysis (0.00)  Bone fracture (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 2  ABPA (0.77)  Osteopenia (0.21)  Intest. Obstruction (0.11)  Hearing Loss (0.10) 
Liver Enzymes (0.07)  Diabetes (0.06)  Depression (0.05)  Pancreatitis (0.05)  
Liver Disease (0.04)  Arthropathy (0.04)  Asthma (0.03)  Bone fracture (0.02)  
Osteoporosis (0.02)  Hypertension (0.01)  Cancer (0.01)  Cirrhosis (0.01)  
Kidney Stones (0.01)  Gall bladder (0.01)  Hemoptysis (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 3  Asthma (0.89)  Liver Disease (0.87)  Diabetes (0.29)  Osteopenia (0.28) 
Liver Enzymes (0.24)  ABPA (0.15)  Osteoporosis (0.11)  Hearing Loss (0.06)  
Arthropathy (0.05)  Intest. Obstruction (0.05)  Depression (0.04)  Hypertension (0.03)  
Cirrhosis (0.02)  Kidney Stones (0.02)  Pancreatitis (0.02)  Gall bladder (0.02)  
Cancer (0.01)  Bone fracture (0.00)  Hemoptysis (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 4  Osteoporosis (0.76)  Diabetes (0.43)  Arthropathy (0.20)  Liver Enzymes (0.18) 
Osteopenia (0.15)  Depression (0.13)  Intest. Obstruction (0.11)  ABPA (0.11)  
Hearing Loss (0.09)  Liver Disease (0.08)  Hypertension (0.07)  Cirrhosis (0.07)  
Kidney Stones (0.03)  Asthma (0.02)  Hemoptysis (0.02)  Bone fracture (0.02)  
Gall bladder (0.01)  Pancreatitis (0.01)  Cancer (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 5  Asthma (0.88)  Diabetes (0.81)  Osteopenia (0.28)  ABPA (0.24) 
Liver Enzymes (0.22)  Depression (0.15)  Osteoporosis (0.14)  Intest. Obstruction (0.12)  
Hypertension (0.10)  Cirrhosis (0.10)  Liver Disease (0.09)  Arthropathy (0.08)  
Bone fracture (0.01)  Hemoptysis (0.01)  Pancreatitis (0.01)  Hearing Loss (0.01)  
Cancer (0.01)  Kidney Stones (0.01)  GI bleed – var. (0.01)  Gall bladder (0.00)  
Colonic stricture (0.00)  GI bleed – no var. (0.00)  
Cluster 6  Liver Disease (0.85)  Liver Enzymes (0.37)  Osteopenia (0.27)  ABPA (0.09) 
Arthropathy (0.07)  Diabetes (0.06)  Intest. Obstruction (0.06)  Osteoporosis (0.05)  
Depression (0.03)  Asthma (0.03)  Hearing Loss (0.03)  Cirrhosis (0.02)  
Hemoptysis (0.02)  Hypertension (0.01)  Kidney Stones (0.01)  Pancreatitis (0.00)  
Gall bladder (0.00)  Bone fracture (0.00)  Cancer (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 7  ABPA (0.83)  Diabetes (0.78)  Osteopenia (0.25)  Osteoporosis (0.24) 
Liver Enzymes (0.15)  Intest. Obstruction (0.12)  Liver Disease (0.09)  Hypertension (0.07)  
Hearing Loss (0.07)  Arthropathy (0.06)  Depression (0.04)  Cirrhosis (0.02)  
Asthma (0.01)  Bone fracture (0.01)  Kidney Stones (0.01)  Hemoptysis (0.01)  
Cancer (0.00)  Pancreatitis (0.00)  Gall bladder (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 8  Diabetes (0.94)  Liver Disease (0.83)  Liver Enzymes (0.43)  Osteopenia (0.30) 
Hearing Loss (0.11)  Osteoporosis (0.10)  Intest. Obstruction (0.09)  Cirrhosis (0.08)  
Depression (0.08)  ABPA (0.07)  Arthropathy (0.06)  Hypertension (0.05)  
Kidney Stones (0.05)  Asthma (0.02)  Hemoptysis (0.01)  Bone fracture (0.01)  
Cancer (0.00)  Pancreatitis (0.00)  Gall bladder (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 9  Asthma (0.89)  Osteopenia (0.26)  ABPA (0.19)  Arthropathy (0.14) 
Intest. Obstruction (0.11)  Depression (0.08)  Osteoporosis (0.08)  Diabetes (0.06)  
Liver Enzymes (0.06)  Hemoptysis (0.03)  Hypertension (0.02)  Liver Disease (0.02)  
Pancreatitis (0.02)  Bone fracture (0.01)  Cancer (0.01)  Cirrhosis (0.01)  
Gall bladder (0.01)  Hearing Loss (0.01)  Kidney Stones (0.00)  Colonic stricture (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00)  
Cluster 10  Osteopenia (0.82)  Diabetes (0.81)  Arthropathy (0.23)  Depression (0.19) 
Liver Enzymes (0.18)  Hypertension (0.16)  Hearing Loss (0.10)  Liver Disease (0.10)  
Osteoporosis (0.10)  Intest. Obstruction (0.09)  ABPA (0.09)  Kidney Stones (0.07)  
Cirrhosis (0.05)  Asthma (0.01)  Cancer (0.00)  GI bleed – var. (0.00)  
Bone fracture (0.00)  Hemoptysis (0.00)  Pancreatitis (0.00)  Gall bladder (0.00)  
Colonic stricture (0.00)  GI bleed – no var. (0.00)  
Cluster 11  Osteopenia (0.77)  Liver Enzymes (0.18)  Arthropathy (0.12)  Depression (0.09) 
Hypertension (0.06)  Diabetes (0.06)  Hearing Loss (0.06)  ABPA (0.05)  
Liver Disease (0.05)  Osteoporosis (0.04)  Intest. Obstruction (0.04)  Cirrhosis (0.02)  
Asthma (0.02)  Pancreatitis (0.02)  Bone fracture (0.01)  Cancer (0.01)  
Kidney Stones (0.00)  Gall bladder (0.00)  Colonic stricture (0.00)  Hemoptysis (0.00)  
GI bleed – no var. (0.00)  GI bleed – var. (0.00) 
f.3 How Does the Temporal Phenotypes Change over Time?
In this subsection, we demonstrate runtime examples of how ACTPC flexibly updates the cluster assignments over time with respect to the future development of comorbidities in the next year. Figure S.2 illustrates three representative patients:

[leftmargin=1.5em]

Patient A had diabetes from the beginning of the study and developed asthma as an additional comorbidity at . Accordingly, ACTPC changed the temporal phenotype assigned to this patient from Cluster 0, which consists of patients who are very likely to develop diabetes but very unlikely to develop asthma in the next year, to Cluster 5, which consists of patients who are likely to develop both diabetes and asthma in the next year, at .

Patient B had ABPA from the beginning of the study and developed diabetes at . Similarly, ACTPC changed the temporal phenotype assigned to this patient from Cluster 2, which consists of patients who are likely to develop ABPA but not diabetes in the next year, to Cluster 7, which consists of patients who are likely to develop both ABPA and diabetes in the next year, at .

Patient C had no comorbidity at the beginning of the study, and developed asthma and liver disease as additional comorbidities, respectively at and . ACTPC changed the temporal phenotypes assigned to this patient from Cluster 1 to Cluster 9 at and then to Cluster 3 at . The changes in the temporal phenotypes were consistent with the actual development of asthma and liver disease considering the distribution of comorbidity development in the next year – that is, Cluster 1 consists of patients who are not likely to develop any comorbidities in the next year, Cluster 9 consists of patients who are likely to develop asthma but not liver disease, and Cluster 3 consists of patients who are likely to develop asthma and liver disease in the next year.
References
 Fortuin et al. (2019) Fortuin, V., Hüser, M., Locatello, F., Strathmann, H., and Rätsch, G. SOMVAE: Interpretable discrete representation learning on time series. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019.
 Konda & Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. Actorcritic algorithms. In Proceedings of the 13th Conference on Neural Information Processing Systems (NIPS 2000), 2000.
 Yang et al. (2017) Yang, B., Fu, X., Sidiropoulos, N. D., and Hong, M. Towards kmeansfriendly spaces: Simultaneous deep learning and clustering. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.
Comments
There are no comments yet.