Deep Semi-Supervised Embedded Clustering (DSEC) for Stratification of Heart Failure Patients

12/24/2020 ∙ by Oliver Carr, et al. ∙ 7

Determining phenotypes of diseases can have considerable benefits for in-hospital patient care and to drug development. The structure of high dimensional data sets such as electronic health records are often represented through an embedding of the data, with clustering methods used to group data of similar structure. If subgroups are known to exist within data, supervised methods may be used to influence the clusters discovered. We propose to extend deep embedded clustering to a semi-supervised deep embedded clustering algorithm to stratify subgroups through known labels in the data. In this work we apply deep semi-supervised embedded clustering to determine data-driven patient subgroups of heart failure from the electronic health records of 4,487 heart failure and control patients. We find clinically relevant clusters from an embedded space derived from heterogeneous data. The proposed algorithm can potentially find new undiagnosed subgroups of patients that have different outcomes, and, therefore, lead to improved treatments.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Patient populations, such as heart failure, are often heterogeneous in presentation and responses to therapy. Heart failure is a complex disease typically classed into two categories by clinicians, one with reduced ejection fraction (HFrEF) or one with preserved ejection fraction (HFpEF) (Inamdar and Inamdar, 2016). Subgroups of heart failure patients can also be defined by additional measures that have been known to reflect poor outcome (e.g. serum urea, serum creatinine) or co-morbidities (e.g. diabetes) (Inamdar and Inamdar, 2016). Better characterisation of subgroups may allow for adjusted treatments in the different patient cohorts.

Embeddings, or low dimension representations of data are used to extract the most relevant features which describe the structure of the data. Low dimensional representation approaches are often based on natural language processing methods

(Zhang et al., 2018; Zhu et al., 2016) or autoencoders to produce latent space representations (Wei and Eickhoff, 2018). Both Denaxas et al. (2018) and Choi et al. (2016) have investigated embeddings using heart failure patient data.

Electronic health record (EHR) data is heterogeneous, varies longitudinally, is sparse, and is not well suited to the application of traditional clustering analysis approaches without careful pre-processing

(Beaulieu-Jones et al., 2018; Donders et al., 2006)

. Clustering approaches can be applied directly to the raw features or to an embedded representation of the input. Principal component analysis (PCA)

(Wold et al., 1987) is a common approach to finding a lower dimensional embedding but unsupervised or semi-supervised deep architectures such as autoencoders allow more complex representations of the data to be learnt (Shickel et al., 2018; Beaulieu-Jones and Greene, 2016). The use of continuous measures from EHRs have been used for patient subgrouping (Liu et al., 2019; Choi et al., 2019). These measures have been used in combination with medical codes, medication information, procedural codes, or text (Li et al., 2019; Miotto et al., 2016; Lasko et al., 2013; Beaulieu-Jones and Greene, 2016).

In this study we explore an extension of deep embedding clustering approaches to EHR patient cohorts in both a unsupervised and semi-supervised fashion. This work builds on previous approaches such as Deep Patient (Miotto et al., 2016) and Beaulieu-Jones et al. (2016) which use standard autoencoders. Previous work has also focused on semi-supervised clustering, such as Enguehard et al. (2019)

who use convolutional neural networks to cluster partially labelled images and Ren et al.


who use pairwise constraints to cluster data, whereas in this work we focus on transfer learning of a network. The key contributions of this paper are i) the application of DEC to patient records, ii) the extension of DEC as a novel semi-supervised approach that allows the handling of heterogeneous input measures and, iii) we show that clinically relevant subgroups within a heart failure cohort can be determined using data driven approaches.

2 Methods

2.1 Deep Embedded Clustering (DEC)

DEC (Xie et al., 2016) is a powerful approach that combines an autoencoder with a clustering loss to learn a representation of the data that aims to also produce separable clusters to improve the analysis of the embedded space.

The DEC method transforms the data space using a non-linear mapping , where is the number of features used and is the embedded space, and is a lower dimensional space than (. is a set of learnable parameters and

is parametrized as a deep neural network.

Initially, a multi-layer deep autoencoder is implemented as a series of stacked de-noising autoencoders, with a bottleneck layer acting as the embedded space. Rectified linear unit (ReLU) activation functions

(Nair and Hinton, 2010) are used at each layer except the first and last layers. The multi-layer deep autoencoder is then trained to optimize a reconstruction loss, by minimizing the least squares error between the input and output of the autoencoder. Once the autoencoder has been pre-trained, the decoder is cut off leaving the encoder to transform the input, to the embedded space, .

Further, a clustering step is then added to the network after the embedded space which uses k-means clustering in the embedded space,

. For this purpose, the network is fine tuned using the Kullback-Leibler (KL) divergence as loss function for the clustering step. An iterative approach is used to assign points in the embedded space,

to the cluster centroids, , and to update the non-linear mapping, , of to . This step is repeated until a convergence criteria is met, full details of method are shown in (Xie et al., 2016).

2.2 Deep Semi-Supervised Embedded Clustering (DSEC)

While autoencoders are an effective approach for learning a lower dimensional embedding, a key challenge of unsupervised approaches to heterogeneous data is that, while the embedding may be representative of the inputs, the representation might not accurately reflect the features of interest for patient or disease stratification. This can be partially resolved by applying a supervised model to the embedding (Beaulieu-Jones and Greene, 2016) or with transfer learning of a pre-trained network (Han et al., 2019). We propose, modifying the latent representation on known patient subgroups by transfer learning of the encoder and fine-tuning of layers. This adapts the embedding to the problem of interest.

We propose to train the DSEC model in three sequential steps: training an autoencoder, updating the weights of the encoder with a classification task, and updating all the layers of the encoder with a clustering loss.

In our approach, a de-noising autoencoder is used to determine the embedded space, which partially corrupts the input, , before reconstruction of the original data (Vincent et al., 2010). The input data , , is corrupted to through a stochastic mapping by the addition of a Gaussian noise layer to the input of the de-noising autoencoder. The reconstruction error is measured by the MAE loss.

The non-linear mapping, or encoder, is represented by two fully connected dense layers, the first with 1000 nodes and the second with 500 nodes, and the embedded space having three nodes. The number of nodes were chosen by adapting the DEC architecture (Xie et al., 2016). Each layer uses ReLU activation functions. The Adam optimizer (learning rate , , ) is used to optimise the weights of the network (Kingma and Ba, 2015).

Once the de-noising autoencoder has been pre-trained, the decoder is removed and a fully connected classification layer with softmax activation is added to the encoder. The de-noising corruption is removed from the architecture and the weights of the first dense layer are fixed. Known labels for each observation are used to update the weights of the final dense layer based on the classification task. Transfer learning updates the weights of the final dense layer and the embedding layer to minimize the binary cross-entropy loss function. This results in an updated mapping of the input to a new embedded space, .

A clustering loss (Kullback-Leibler divergence) is then used further update the entire encoder and latent space, updating the non-linear mapping to

. The optimization of the encoder and cluster centers is performed as in (Xie et al., 2016). The cluster centers, , and the encoder parameters, , are jointly optimized using the Adam optimizer (learning rate , , ).

The number of epochs used for training were determined from the loss of the validation set in order to avoid overfitting to the training set. For DEC, the autoencoder was trained for 50 epochs and the clustering step for 200 epochs. For DSEC, the autoencoder was trained for 50 epochs, the semi-supervised transfer learning for 10 epochs, and the clustering for 200 epochs. The models were then trained on the entire training set before being applied to the test set.

2.3 Analysis of the Embedded Space

Figure 1: Hierarchical clustering of vital signs and laboratory measures shown as a PCA projection of the three-dimensional embedded space from DSEC.

Figure 2: Receiver operating characteristic curves for classification of heart failure and control patients. Curves are shown for PCA and random forest, DEC and random forest, and DSEC.

After a low dimensional patient representation is created using DSEC, we define the patient subgroups using standard clustering approaches on the embedded space. Agglomerative clustering is a type of hierarchical clustering method in which all observations initially start as individual clusters before pairs of clusters are successively merged into a new cluster (Rokach and Maimon, 2005), with the Ward criteria used as the linkage function (Ward and Hook, 1963).

Once the subgroups are found we compared the dominant ICD-10 diagnosis codes in each subgroup using enrichment analysis. Enrichment analysis is performed using the Fisher exact test

(Fisher, 1922)

for pairwise comparisons between ICD-10 codes within clusters. For each ICD-10 code the log odds ratio and corresponding p-value are found. The p-values are corrected for multiple testing using Bonferroni correction, with statistically significant odds ratios indicating the ICD-10 code is enriched in one of the clusters. For hierarchical clustering, enrichment is performed pairwise between the two clusters which are combined in each of the agglomeration steps.

3 Data

De-identified patient health record data was obtained from a large UK general trust hospital. These patients underwent digital monitoring of bedside vital sign measurements (Wong et al., 2017)

. From this dataset, patients with a primary diagnosis of heart failure (ICD-10 code I50*) were selected. Each patient has a number of admissions that may occur before, on or after a heart failure diagnosis. This resulted in 2,791 patients with 27,143 admissions. We used propensity matching (based on logistic regression and nearest neighbor matching

(Ho et al., 2007)) on age and sex to derive a control cohort of patients with any other admission besides I50*. This resulted in a total of 5,498 patients with 39,908 admissions. Within each admission there may be multiple measurements, in this analysis we take the mean of each measurement within an admission. The admission with the first heart failure diagnosis is selected and for the control cohort, the admission with the fewest missing values was selected.

3.1 Data Pre-processing

We use Bidirectional Recurrent Imputation for Time Series (BRITS), which combines an imputation loss with the loss of a prediction or classification task

(Cao et al., 2018), which we found to be most effective on previous cohorts. Each admission contains laboratory work and vital sign measurements. Vital signs and laboratory measures were selected if each measure was present in more than 60% of cases. Features used in this work include systolic blood pressure, diastolic blood pressure, heart rate, oxygen saturation (SpO2), temperature, alanine aminotransferase (ALT), creatinine, c-reactive protein (CRP), platelets, potassium, sodium, urea and white blood cells.

Cases were then excluded if less than 60% of these features were present for a particular admission. This resulted in a reduction in patients to 4,497 (2,298 heart failure and 2,199 control). A test set of 25% was removed from the dataset in a stratified way and was held out of all training procedures. The remaining 75% of the data was used in 5-fold cross validation in order to optimize the network parameters and ensure the model was not overfitting.

4 Results and Discussion

The accuracy of distinguishing heart failure and non-heart failure cases using the three approaches was determined through the area under the ROC curves as shown in Figure 2. ROC curves are obtained by training a random forest on the PCA and DEC embedded spaces, whereas for DSEC the ROC curve is obtained from the semi-supervised classification step. DSEC obtains an area under the ROC curve of 0.84, considerably outperforming PCA (0.66) and DEC (0.73).

Figure 1 shows a hierarchical clustering of the learnt DSEC embedding, where we iteratively combine the groups into larger subgroups in order to investigate whether different comorbidities exist within different spaces of the embedding.

Group Enriched ICD-10 Codes
1 E78.0 Pure hypercholesterolaemia (0.96)
2 I50.0 Congestive heart failure (1.69)
E87.7 Fluid overload (1.39)
I50.9 Heart failure (1.32)
N18.9 Chronic renal failure (1.22)
I34.0 Mitral (valve) insufficiency (1.17)
2.1 I42.0 Dilated cardiomyopathy (1.56)
N17.9 Acute renal failure (1.37)
N39.0 Urinary tract infection (1.34)
Z95.1 Aortocoronary bypass graft (1.10)
N18.9 Chronic renal failure (1.04)
2.1.1 I50.1 Left ventricular failure (2.35)
2.1.2 R18 Ascites (3.00)
E87.5 Hyperkalaemia (2.13)
I42.0 Dilated cardiomyopathy (1.55)
I50.0 Congestive heart failure (1.16)
N17.9 Acute renal failure (1.15) E87.5 Hyperkalaemia (3.26)
Z51.5 Palliative care (2.69)
2.2.1 I27.2 Second. pulmonary hypertension (2.91)
N18.9 Chronic renal failure (1.45)
I48.9 Atrial fibrillation and flutter (1.36)
E87.7 Fluid overload (1.09)
Z92.1 Use of anticoagulants (0.95)
Table 1: Enriched ICD10 codes between hierarchical splits in the clustering (log odds-ratio shown in brackets). Enrichment is performed between pairs of clusters, (for example 1 vs 2, 2.1 vs 2.2, and 1.1.1 vs 1.1.2). Sub-hierarchies of cluster 1 (control) are not shown.

Table 1 shows cluster enrichment for the hierarchical splits. The first split in the hierarchical clustering is between heart failure (group 2) and controls (group 1). Subgroups of heart failure can be identified, including dilated cardiomyopathy, renal failure, and aortocoronary bypass grafts in a heart failure subgroup (group 2.1). Group 2.1 can be further split in to subgroups associated with left ventricular failure (2.1.1) and with ascites and hyperkalaemia (2.1.2). Group 2.2 can also be divided in to patients associated with secondary pulmonary hypertension, atrial fibrillation and flutter, and a history of anticoagulant use (2.2.1). This demonstrates that the method is capable of determining clinically relevant (Maisel and Stevenson, 2003; Dickhout et al., 2011) subgroups from vital signs and laboratory measures. Further analysis of the subgroups and a comparison between PCA, DEC, and DSEC showing the superior performance of DSEC is shown in the supplementary materials.

While this is a powerful extension of the standard autoencoder embedding, which has been previously applied, current limitations are that we only consider a single admission per patient and the inability to deal with missing values, which are a common problem in EHRs.

We have shown our method outperforms other clustering algorithms to determine subgroups within heart failure patients. We aim to extend this approach to handle multiple admissions and to develop imputation free methods of embedding to further improve phenotyping of heart failure and other diseases from EHRs. This has the potential to allow for adjusted treatments of the different patient subgroups.

5 Conclusions

In this paper we demonstrate the application of DSEC to features derived from EHRs. We show our approaches can distinguish heart failure and non-heart failure cases based on laboratory measurements and vital signs. We illustrate that optimizing the embedding on known subgroups allows us to learn a more powerful representation and that subgroups within the heart failure cohort show enrichment of certain co-morbidities (ICD-10 codes).


This work uses data provided by patients and collected by the NHS as part of their care and support. We believe using patient data is vital to improve health and care for everyone and would, thus, like to thank all those involved for their contribution. The data were extracted, anonymised, and supplied by the Trust in accordance with internal information governance review, NHS Trust information governance approval, and General Data Protection Regulation (GDPR) procedures outlined under the Strategic Research Agreement (SRA) and relative Data Sharing Agreements (DSAs) signed by the Trust and Sensyne Health plc.

This research has been conducted using the Oxford University Hospitals NHS Foundation Trust Clinical Data Warehouse, which is supported by the NIHR Oxford Biomedical Research Centre and Oxford University Hospitals NHS Foundation Trust. Special thanks to Kerrie Woods, Kinga Varnai, Oliver Freeman, Hizni Salih, Zuzana Moysova, Professor Jim Davies and Steve Harris.


  • B. K. Beaulieu-Jones and C. S. Greene (2016) Semi-Supervised Learning of the Electronic Health Record for Phenotype Stratification. J Biomed Inform 64, pp. 168–178. Cited by: §1, §1, §2.2.
  • B. K. Beaulieu-Jones, D. R. Lavage, J. W. Snyder, J. H. Moore, S. A. Pendergrass, and C. R. Bauer (2018) Characterizing and Managing Missing Structured Data in Electronic Health Records: Data Analysis. JMIR Medical Informatics 6 (1), pp. e11. Cited by: §1.
  • W. Cao, H. Zhou, D. Wang, Y. Li, J. Li, and L. Li (2018) BRITS: Bidirectional recurrent imputation for time series. Advances in Neural Information Processing Systems (NeurIPS), pp. 6775–6785. Cited by: §3.1.
  • E. Choi, A. Schuetz, W. F. Stewart, and J. Sun (2016) Medical concept representation learning from electronic health records and its application on heart failure prediction. External Links: Link, 1602.03686 Cited by: §1.
  • E. Choi, Z. Xu, Y. Li, M. W. Dusenberry, G. Flores, Y. Xue, and A. M. Dai (2019) Graph Convolutional Transformer: Learning the Graphical Structure of Electronic Health Records. pp. 1–17. External Links: Link Cited by: §1.
  • S. Denaxas, P. Stenetorp, S. Riedel, M. Pikoula, R. Dobson, and H. Hemingway (2018) Application of Clinical Concept Embeddings for Heart Failure Prediction in UK EHR data. External Links: Link Cited by: §1.
  • J. G. Dickhout, R. E. Carlisle, and R. C. Austin (2011) Interrelationship between cardiac hypertrophy, heart failure, and chronic kidney disease: Endoplasmic reticulum stress as a mediator of pathogenesis. Circulation Research 108 (5), pp. 629–642. Cited by: §4.
  • A. R. T. Donders, G. J.M.G. van der Heijden, T. Stijnen, and K. G.M. Moons (2006) Review: A gentle introduction to imputation of missing values. Journal of Clinical Epidemiology 59 (10), pp. 1087–1091. Cited by: §1.
  • J. Enguehard, P. O’Halloran, and A. Gholipour (2019)

    Semi-Supervised Learning With Deep Embedded Clustering for Image Classification and Segmentation

    IEEE Access 7 (1), pp. 11093–11104. Cited by: §1.
  • R. A. Fisher (1922) On the Interpretation of

    2 from Contingency Tables, and the Calculation of P

    Journal of the Royal Statistical Society 85 (1), pp. 87. Cited by: §2.3.
  • K. Han, A. Vedaldi, and A. Zisserman (2019) Learning to Discover Novel Visual Categories via Deep Transfer Clustering. External Links: 1908.09884, Link Cited by: §2.2.
  • D. E. Ho, K. Imai, G. King, and E. A. Stuart (2007) Matching as nonparametric preprocessing for reducing model dependence in parametric causal inference. Political Analysis 15 (3), pp. 199–236. Cited by: §3.
  • A. Inamdar and A. Inamdar (2016) Heart Failure: Diagnosis, Management and Utilization. Journal of Clinical Medicine 5 (7), pp. 62. Cited by: §1.
  • D. P. Kingma and J. L. Ba (2015) Adam: A method for stochastic optimization. 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15. External Links: 1412.6980 Cited by: §2.2.
  • T. A. Lasko, J. C. Denny, and M. A. Levy (2013) Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data. PLoS One 8 (6), pp. e66341. Cited by: §1.
  • Y. Li, S. Rao, J. R. A. Solares, A. Hassaine, D. Canoy, Y. Zhu, K. Rahimi, and G. Salimi-Khorshidi (2019) BEHRT: Transformer for Electronic Health Records. External Links: Link Cited by: §1.
  • L. Liu, H. Li, Z. Hu, H. Shi, Z. Wang, J. Tang, and M. Zhang (2019) Learning Hierarchical Representations of Electronic Health Records for Clinical Outcome Prediction. External Links: Link Cited by: §1.
  • W. H. Maisel and L. W. Stevenson (2003) Atrial fibrillation in heart failure: Epidemiology, pathophysiology, and rationale for therapy. American Journal of Cardiology 91 (6), pp. 2–8. Cited by: §4.
  • R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley (2016) Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Scientific Reports 6, pp. 26094. Cited by: §1, §1.
  • V. Nair and G. Hinton (2010)

    Rectified Linear Units Improve Restricted Boltzmann Machines

    International Conference on Machine Learning. Cited by: §2.1.
  • Y. Ren, K. Hu, X. Dai, L. Pan, S. C.H. Hoi, and Z. Xu (2019) Semi-supervised deep embedded clustering. Neurocomputing 325, pp. 121–130. Cited by: §1.
  • L. Rokach and O. Maimon (2005) Clustering Methods. In Data Mining and Knowledge Discovery Handbook, O. Maimon and L. Rokach (Eds.), pp. 321–352. Cited by: §2.3.
  • B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi (2018)

    Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis

    IEEE J Biomed Health Inform 22 (5), pp. 1589–1604. Cited by: §1.
  • P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol (2010)

    Stacked denoising autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion

    Journal of Machine Learning Research 11, pp. 3371–3408. Cited by: §2.2.
  • J. H. Ward and M. E. Hook (1963) Application of an Hierarchical Grouping Procedure to a Problem of Grouping Profiles. Educational and Psychological Measurement 23 (1), pp. 69–81. Cited by: §2.3.
  • X. Wei and C. Eickhoff (2018) Embedding Electronic Health Records for Clinical Information Retrieval. External Links: 1811.05402, Link Cited by: §1.
  • S. Wold, K. Esbensen, and P. Geladi (1987) Principal component analysis. Chemometrics and intelligent laboratory systems 2 (1-3), pp. 37–52. Cited by: §1.
  • D. Wong, N. Wu, and P. Watkinson (2017) Quantitative metrics for evaluating the phased roll-out of clinical information systems. International Journal of Medical Informatics 105, pp. 130–135. Cited by: §3.
  • J. Xie, R. Girshick, and A. Farhadi (2016) Unsupervised deep embedding for clustering analysis. 33rd International Conference on Machine Learning, ICML 2016 1, pp. 740–749. External Links: Link Cited by: §2.1, §2.1, §2.2, §2.2.
  • J. Zhang, K. Kowsari, J. H. Harrison, J. M. Lobo, and L. E. Barnes (2018) Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record. IEEE Access 6, pp. 65333–65346. Cited by: §1.
  • Z. Zhu, C. Yin, B. Qian, Y. Cheng, J. Wei, and F. Wang (2016) Measuring patient similarities via a deep architecture with medical concept embedding. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 749–758. Cited by: §1.