Deep Representation Learning of Electronic Health Records to Unlock Patient Stratification at Scale

Objective: Deriving disease subtypes from electronic health records (EHRs) can guide next-generation personalized medicine. However, challenges in summarizing and representing patient data prevent widespread practice of scalable EHR-based stratification analysis. Here, we present a novel unsupervised framework based on deep learning to process heterogeneous EHRs and derive patient representations that can efficiently and effectively enable patient stratification at scale. Materials and methods: We considered EHRs of 1,608,741 patients from a diverse hospital cohort comprising of a total of 57,464 clinical concepts. We introduce a representation learning model based on word embeddings, convolutional neural networks and autoencoders (i.e., "ConvAE") to transform patient trajectories into low-dimensional latent vectors. We evaluated these representations as broadly enabling patient stratification by applying hierarchical clustering to different multi-disease and disease-specific patient cohorts. Results: ConvAE significantly outperformed several common baselines in a clustering task to identify patients with different complex conditions, with 2.61 entropy and 0.31 purity average scores. When applied to stratify patients within a certain condition, ConvAE led to various clinically relevant subtypes for different disorders, including type 2 diabetes, Parkinson's disease and Alzheimer's disease, largely related to comorbidities, disease progression, and symptom severity. Conclusions: Patient representations derived from modeling EHRs with ConvAE can help develop personalized medicine therapeutic strategies and better understand varying etiologies in heterogeneous sub-populations.


page 22

page 23

page 42


Using deep learning for comprehensive, personalized forecasting of Alzheimer's Disease progression

A patient is more than one number, yet most approaches to machine learni...

Latent Gaussian process with composite likelihoods for data-driven disease stratification

Data-driven techniques for identifying disease subtypes using medical re...

BEHRT: Transformer for Electronic Health Records

Today, despite decades of developments in medicine and the growing inter...

Integrative Analysis of Patient Health Records and Neuroimages via Memory-based GraphConvolutional Network

With the arrival of the big data era, more and more data are becoming re...

Temporal Clustering with External Memory Network for Disease Progression Modeling

Disease progression modeling (DPM) involves using mathematical framework...

Unsupervised Machine Learning for the Discovery of Latent Disease Clusters and Patient Subgroups Using Electronic Health Records

Machine learning has become ubiquitous and a key technology on mining el...

Evaluation of Semantic Web Technologies for Storing Computable Definitions of Electronic Health Records Phenotyping Algorithms

Electronic Health Records are electronic data generated during or as a b...


Electronic health records (EHRs) are collected as part of routine care across the vast majority of healthcare institutions. They consist of heterogeneous structured and unstructured data elements, including demographic information, diagnoses, laboratory results, medication prescriptions, free text clinical notes, and images. EHRs provide snapshots of a patient’s state of health and have created unprecedented opportunities to investigate the properties of clinical events across large populations using data-driven approaches and machine learning. At the individual level, patient trajectories can foster personalized medicine; across a population, EHRs can provide a vital resource to understand population health management and help make better decisions for healthcare operation policies


Personalized medicine focuses on the use of patient-specific data to tailor treatment to an individual’s unique health characteristics. However, even seemingly simple diseases can show different degrees of complexity that can create challenges for identification, treatment, and prognosis, despite equivalence at the diagnostic level [cutting2015, alexandrov2016]. Heterogeneity among patients is particularly evident for complex disorders, where the etiology is due to an amalgamation of multiple genetic, environmental, and lifestyle factors. Several different conditions have been referred to as complex, such as Parkinson’s disease (PD) [langston2006], multiple myeloma (MM) [mel2014], and type 2 diabetes (T2D) [pearson2019]. Patients with complex disorders may differ on multiple systemic layers (e.g., different clinical measurements or comorbidity landscape) and in response to treatments, making these conditions difficult to model. Multiple data types in patient longitudinal EHR histories offer a way to examine disease complexity and present an opportunity to refine diseases into subtypes and tailor personalized treatments. This task is usually referred to as “EHR-based patient stratification”. This follows a common approach in clinical research, where attempts to identify latent patterns within a cohort of patients can contribute to the development of improved personalized therapies [dugger2017].

From a computational perspective, patient stratification is a data-driven, unsupervised learning task that groups patients according to their clinical characteristics

[baytas2017patient]. Previous work in this domain aggregates clinical data at a patient level, representing each patient as multi-dimensional vectors, and derives subtypes within a disease-specific population via clustering (e.g., in autism [doshivelez2013]) or topological analysis (e.g., for T2D [li2015identification]). Deep learning has been applied to derive more robust patient representations to improve disease subtyping [baytas2017patient, zhang2019data]

. Baytas et al. used time-aware Long Short-Term Memory (LSTM) networks to leverage stratification of longitudinal data of PD patients

[baytas2017patient]. Similarly, Zhang et al. used LSTM to identify three subgroups of patients with idiopathic PD that differ in disease progression patterns and symptom severity [zhang2019data]. These studies, however, only focused on curated and small disease-specific cohorts, with ad hoc manually selected features. This approach not only limits scalability and generalizability, but also hinders the possibility to discover unknown patterns that might characterize a condition. Because EHRs tend to be incomplete, using a diverse cohort of patients to derive disease-specific subgroups can adequately capture the features of heterogeneity within the disease of interest [chen2019deep]. However, it is challenging to create large-scale computational models from EHRs because of data quality issues, such as high dimensionality, heterogeneity, sparseness, random errors, and systematic biases [miotto17review, xiao2018review].

This paper proposes a general framework for identifying disease subtypes at scale (see Figure 1b). We first propose an unsupervised deep learning architecture to derive vector-based patient representations from a large and domain-free collection of EHRs. This model (i.e., ConvAE) combines 1) embeddings to contextualize medical concepts, 2) convolutional neural networks (CNNs) to loosely model the temporal aspects of patient data, and 3) autoencoders (AE) to enable the application of an unsupervised architecture. Second, we show that ConvAE-based representations learned from real-world EHRs of about M patients from the Mount Sinai Health System in New York improve clustering of patients with different disorders compared to several commonly used baselines. Last, we demonstrate that ConvAE leads to effective patient stratification with minimal effort. To this end, we used the encodings learnt from domain-free and heterogeneous EHRs to derive subtypes for different complex disorders and provide a qualitative analysis to determine their clinical relevance.

To the best of our knowledge, this is the first architecture that enables patient stratification at scale by eliminating the need for manual feature engineering and explicit labeling of events within patient care timelines, and that processes the whole EHR sequence regardless of the length of patient history. By generating disease subgroups from large-scale EHR data, this architecture can help disentangle clinical heterogeneity and identify high-impact patterns within complex disorders, whose effect may be masked in case-control studies [manchia2013]. The specific properties of the different subgroups can then potentially inform personalized treatments and improve patient care.


We first evaluated the extent to which ConvAE-based patient representations can be used to identify different clinical diagnoses in the EHRs (i.e., disease phenotyping [shah18]

). To this end, we performed clustering analysis using patients with the following eight complex disorders: T2D, MM, PD, Alzheimer’s disease (AD), Crohn’s disease (CD), breast cancer (BC), prostate cancer (PC), and attention deficit hyperactivity disorder (ADHD). We used SNOMED

[cote1980] to find all patients in the data warehouse diagnosed with these conditions; see Supplementary Table LABEL:tab:disclass and the “Multi-Disease Clustering Analysis” subsection in “Methods” for more details.

Table 1 shows the results using hierarchical clustering for different ConvAE architectures (one, two, and multikernel CNN layers) and baselines. Entropy and purity scores are averaged over a 2-fold cross validation experiment, where we used independent cohorts of about random patients to train the models and about test patients. These were derived by retaining only the patients in the whole test sets (of about patients per fold) diagnosed with one of the eight disorders under consideration (see the “Dataset” subsection in “Methods” for more details). ConvAE performed significantly better than other models largely used in healthcare for representation learning, including Deep Patient [miotto2016deep], for both entropy and purity scores (

, t-tests comparisons with Bonferroni correction). The configuration with one CNN layer obtained the best overall performance and was also able to associate clusters to the largest number of distinct diseases (i.e,

, based on purity score analysis).

Figure 2 visualizes the distribution of the different patient representations along with their disease cohort labels obtained using UMAP (Uniform Manifold Approximation and Projection for dimension reduction [mcinnes2018umap]). ConvAE captures hidden patterns of overlapping phenotypes while still displaying identifiable groups of patients with distinct disorders. Figure 3 shows the same patient distribution highlighting clustering labels and purity percentage scores of each cluster dominating disease. These figures refer to only one of the cross-validation splits; results for the second split are similar and are available in Supplementary Figures LABEL:fig:encodings2 and LABEL:fig:clustering2). ConvAE (with one CNN layer) also led to better clustering, visually, than all baselines. Patients with ADHD were the most separated and detected with purity by hierarchical clustering. Visible clusters with purity were also identified for T2D, PC and PD. Comparing the encoding projections (Figure 2) to the clustering visualization (Figure 3), we observe that patients whose disease is not correctly identified by clusters tend to not clearly separate in this low-dimensional space. As an example, AD patients were randomly scattered in the plot and did not lead to distinguishable clusters. This might be due to factors such as sex and age, or noise, but it might also reflect a shared phenotypic characterization that drives the learning process into displaying these patient EHR progressions closely together irrespective of disease labels.

We then evaluated the use of ConvAE representations for patient stratification at scale and the identification of clinically relevant disease subtypes. We considered six diseases: T2D, PD, AD, MM, PC, and BC. These are all age-related complex disorders with late onset (i.e., averaged increased prevalence after years of age) [cowie2018, lau2006, qiu2009, kazandjian2016, pc1, bc2]. We decided to focus on these conditions to avoid, to some extent, the confounding effect of age that could affect learning and the evaluation of different subtypes. Figure 4

shows results running hierarchical clustering on the ConvAE-based patient representations of each different disease cohorts. To determine optimal number of clusters, we empirically selected the smallest number of clusters that minimize the increase in explained variance (i.e., Elbow method). We were able to identify different subtypes for each disease with no additional feature selection and using representations derived from a domain-free cohort of patients. Supplementary Table 

LABEL:tab:inner reports the number of patients in each cohort and the number of subgroups identified. Similar results were obtained for the second split and are reported in Supplementary Figure LABEL:fig:innerval2.

In the following sections, we present the clinical characterization of T2D, PD, and AD subgroups via enrichment analysis of medical concept occurrences (see Supplementary Material for the characterization of the other conditions). We compare T2D and PD results to related studies based on ad hoc cohorts [li2015identification, zhang2019data]. Conversely, there are no published EHR-based stratification studies for AD, MM, PC, and BC to use for comparison. All subtypes were reviewed by a clinical expert to highlight meaningful descriptors and we used multiple pairwise chi-squared tests to assess group differences. For each disease, we list sex and age statistics of the cohort, as well as the five most frequent diagnosis, medications, laboratory tests, and procedures, ordered according to in-group and total frequencies, in Supplementary Tables LABEL:tab:t2d-LABEL:tab:bc. Between group comparisons are performed via multiple pairwise chi-squared tests and t-tests. The results for the second split are reported in Supplementary Tables LABEL:tab:t2d2-LABEL:tab:bc2.

Type 2 diabetes

Patients with T2D clustered into three different subgroups for comprehensive details that relate to different stages of progression for the disease (see Figure 4a and Supplementary Table LABEL:tab:t2d for details).

Subgroup I included patients and represents the mild symptom severity cohort, characterized by common T2D symptoms (e.g., metabolic syndrome), which were treated with Metformin, an oral hypoglycemic medication. Moreover, it also included patients exposed to lifestyle risk factors, such as Obesity [pearson2019].

Subgroups II/III, which were composed by and patients, respectively, showed concomitant conditions associated to T2D progression and worsening symptoms. Specifically, subgroup II clustered patients characterized by microvascular problems, such as diabetic nephropathy, neuropathy, and/or peripheral artery disease. The significant presence of Creatinine and Urea nitrogen

laboratory tests, which estimate renal function, suggests monitoring of kidney diseases, which are often related to T2D

[vallon2011]. The presence of Pain in limb, combined with analgesic drugs (i.e., Paracetamol, Oxycodone), indicates the presence of vascular lesions at the peripheral level, manifested as ischemic rest pain or ulceration. This was confirmed by Peripheral vascular disease diagnoses which accounts for of terms in the T2D cohort.

Subgroup III showed severe cardiovascular problems, identified by a significant presence of medical concepts related to coronary artery diseases, e.g., Coronary atherosclerosis, Angina pectoris, which are serious risk factors for heart failure. These subjects were often treated with antiplatelet therapy (i.e., Acetylsalicylic acid, Clopidrogel) to prevent cardiovascular events (e.g., stroke) and were likely to receive invasive procedures to treat severe arteriopathy. For instance, of patients in subgroup III underwent Percutaneous Transluminal Coronary Angioplasty, a procedure to open up blocked coronary arteries.

Our results confirm, in part, what was observed by Li et al. [li2015identification], which used topology analysis on an ad hoc cohort of T2D patients and identified three distinct subgroups characterized by 1) microvascular diabetic complications (i.e., diabetic nephropathy, diabetic retinopathy); 2) cancer of bronchus and lungs; and 3) cardiovascular diseases and mental illness. In particular, we detected the same microvascular and cardiovascular disease groups, which are consequences of T2D. In contrast, we were unable to detect a subgroup significantly characterized by cancer, an epiphenomenon that can be caused by secondary immunodeficiency in patients with T2D [malaguarnera2010, delamaire1997]. See Supplementary Material for further description and a clustering comparison via Fowlkes-Mallows index.

Parkinson’s disease

Individuals diagnosed with PD divided into two groups (Figure 4b and Supplementary Table LABEL:tab:pd): one dominated by motor symptoms ( patients) and another ( patients) characterized by non-motor/independent features and longer course of disease.

Subgroup I is characterized as a tremor-dominant cohort (i.e., manifested by motor symptoms) because of the significant presence of diagnosis such as Essential tremor, Anxiety state, and Dystonia. It is interesting to note that motor clinical features likely lead to a common misdiagnosis of essential tremor, which is an action tremor that typically involves the hands. Parkinsonian tremor, on the contrary, although can be present during postural maneuvers and action, is much more severe at rest and decreases with purposeful activities. However, when the tremor is severe, it is difficult to distinguish action tremor from resting tremor, leading to the aforementioned misdiagnosis [jain2006]. Moreover, anxiety states, emotional excitement, and stressful situations can exacerbate the tremor, and lead to a delayed PD diagnosis. Brain MRI, usually non-diagnostic in PD, was ordered for several patients in this subgroup () suggesting its use for differential diagnosis, e.g., to investigate the presence of chronic/vascular encephalopathy.

Subgroup II included non-motor and independent symptoms, such as Constipation and Fatigue. Patients in subgroup II were significantly diagnosed with Coronary artery disease that is prevalent in older patients (

years old). Constipation and fatigue are among the most common non-motor problems related to autonomic dysfunction, diminished activity level, and slowed intestinal transit time in PD

[alves2004, siciliano2018].

In their study about PD stratification with PPMI (Parkinson’s Progression Markers Initiative) data, Zhang et al. [zhang2019data] identified three distinct subgroups of patients based on severity of both motor and non-motor symptoms. In particular, one subgroup included patients with moderate functional decay in motor ability and stable cognitive ability; a second subgroup presented with mild functional decay in both motor and non-motor symptoms; and the third subgroup was characterized by rapid progression of both motor and non-motor symptoms. EHRs do not quantitatively capture PD symptom severity, therefore our analyses cannot replicate these findings. However, unlike Zhang et al., we can discriminate between specific motor and non-motor symptoms and also suggest a longer, but not necessarily more severe, disease course for the non-motor symptom subgroup.

Alzheimer’s disease

Patients with AD separated into three subgroups marked by AD onset, disease progression, and severity of cognitive impairment (see Figure 4c and Supplementary Table LABEL:tab:ad).

Subgroup I is characterized by patients with early-onset AD, i.e., patients whose dementia symptoms have typically developed between the age of and years, and initial neurocognitive disorder. Early-onset AD affects of the individuals with AD in the US [ad] and, because clinicians do not usually look for AD in younger patients, the diagnostic process includes extensive evaluations of patient symptoms. In particular, given that a certain AD diagnosis can only be provided post-mortem through brain examination, clinicians first rule out other causes that can lead to early-onset dementia (i.e., differential diagnosis). We find evidence of this practice in this subgroup, which includes postmenopausal women, identifiable by mean age greater than , Osteoporosis diagnosis with calcium supplement therapy, and menopausal hormone treatment (i.e., Estradiol). Patients in this group are also tested for infectious diseases (e.g., HIV, Syphilis, Hepatitis C, Chlamydia/Gonorrhoea), that are possible causes of early-onset dementia [manji2013], and screened via structural neuroimaging, e.g., MRI/PET brain. As cognitive dysfunctions that may be mistaken for dementia can also be caused by depression and other psychiatric conditions, the presence of Psychiatric service/procedure suggests psychiatric evaluations to exclude depressive pseudodementia. After the differential diagnosis process and the exclusion of other possible causes, eventually these patients received a diagnosis of AD.

Subgroup II includes patients with late-onset AD, mild neuropsychiatric symptoms and cerebrovascular disease. Here, the absence of behavioral disturbances in of patients, and their high average age (), suggest a late AD onset, with a progression characterized by a slower rate of cognitive ability decline [lyketsos2002]. Moreover, the presence of Acetylsalicylic acid, an antiplatelet medication, and Intracranial hemorrage diagnosis indicates the co-occurrence of cerebrovascular disease, which affects blood vessels and blood supply to the brain. Cerebrovascular diseases are common in aging, and can often be associated with AD [snyder2015]. In this regard, Head CT may have been performed to prevent or identify structural abnormalities related to cerebrovascular disease.

Subgroup III is characterized by individuals with typical onset and mild-to-moderate dementia symptoms. A cohort of patients was treated with Donepezil, a cholinesterase inhibitor, that is a primary treatment for cognitive symptoms and it is usually administered to patients with mild-to-moderate AD, producing small improvement in cognition, neuropsychiatric symptoms, and activities of daily living [birks2018]. Patients in this subgroup also showed both dementia with and without behavioral disturbances.


This study proposes a computational framework to disentangle the heterogeneity of complex disorders in large-scale EHRs through the identification of data-driven clinical patterns with machine learning. Specifically, we developed and validated an unsupervised architecture based on deep learning (i.e., ConvAE) to infer informative vector-based representations of millions of patients from a large and diverse hospital setting, which facilitate the identification of disease subgroups that can be leveraged to personalize medicine. These representations aim to be domain-free (i.e., not related to any specific task since learned over a large multi-domain dataset) and enable patient stratification at scale. Results from our experiments show that ConvAE significantly outperformed several baselines on clustering patients with different complex conditions, and led to the identification of different clinically meaningfully disease subtypes.

Results identified disease progression, symptom severity, and comorbidities as contributing the most to the EHR-based clinical phenotypic variability of complex disorders. In particular, T2D patients divided into three subgroups according to comorbidities (i.e., cardiovascular and microvascular problems) and symptom severity (i.e. newly diagnosed with milder symptoms). Individuals with PD showed different disease duration and symptoms (i.e., motor, non-motor). AD profiles distinguished early- and late-onset groups and separate patients with mild neuropsychiatric symptoms and cerebrovascular disease from patients with mild-to-moderate dementia. Patients with MM were characterized by different comorbidities (e.g., amyloidosis, pulmonary diseases) that manifest alongside precise typical signs of MM. Patients with PC and BC separated according to disease progression. These findings showed that the features learned by ConvAE describe patients in a way that is general and conducive to identifying meaningful insights into different clinical domains. In particular, this work aims to contribute to the next generation of clinical systems that can (1) scale to include many millions of patient records and (2) use a single, distributed patient representation to effectively support clinicians in their daily activities, rather than multiple systems working with different patient representations derived for different tasks [miotto2016deep].

To this aim, enabling efficient data-driven patient stratification analyses to identify disease subgroups is an important aspect to unlock personalized healthcare. Ideally, when new patients enter the medical system, their health status progression can be tied to a specific subgroup, thereby informing the treating clinician of personalized prognosis and possible effective treatment strategies. This can be helpful in cases where a certain diagnosis is difficult and a more thorough examination is required, which sometimes might not come to mind of a busy clinician (e.g, specific genetic or lab tests). Moreover, the clinical characteristics of the different subtypes can potentially lead to intuitions for novel discoveries, such as comorbidities, side-effects or repositioned drugs, which can be further investigated analysing the patient clinical trajectories. To the best of our knowledge, this is the first attempt to derive a computational framework that enables scalable and effective patient stratification with the goal of identifying disease subtypes from the EHRs. Previous studies mostly focused on a specific disease using ad hoc cohorts of patient data [baytas2017patient, doshivelez2013, li2015identification, zhang2019data, lombardo2016, stevens2019]. While these study obtained relevant clinically meaningful results, the computational framework is hard to replicate for different diseases and it is tied to the specific study and to the specific data. Deep learning has recently been used to model EHRs for clinical prediction (e.g., [miotto17review, xiao2018review, choi15, deepcare2016, rajkomar_scalable_2018]) and disease phenotyping (e.g., [miotto2016deep, beaulieu2016semi]) but has not been used to leverage disease subtyping at scale.

There are several limitations to our study. First, we acknowledge that the lack of any discernible pattern can also be due to noise in the data. In particular, processing EHRs with minimum data engineering, on the one hand, preserves all the available information and, to some extent, prevents systematic biases. On the other, it adds hospital-specific biases intrinsic to the EHR structure and noise due to data being redundant and too generic. Improving EHR pre-processing by, e.g., better modeling clinical notes and/or improving feature filtering, should help reduce noise and improve performances. Second, we identified patients related to complex disorders using SNOMED codes and this likely led to the inclusion of many false positives that affected the learning algorithms [wei2015]. The use of phenotyping algorithms based on manual rules, e.g., PheKB [kirby2016], or semi-automated approaches, e.g. [halpern2016, glicksberg2017]), should help identify better cohorts of patients and, consequently, better disease subtypes. Lastly, we identified relevant concepts in the patient subgroups by simply evaluating their frequency. Adding a semantic modeling component based on, e.g., topic modeling [blei2003] or word embeddings [mikolov2013efficient], might lead to more clinically meaningful patterns.

Future works will attempt to address these limitations and to further improve and replicate the architecture. First, we plan to enable multi-level clustering in order to stratify patients within the subtypes. This should lead to more granular patient stratification and thus, to patterns on a more individual-level. Second, we plan to assess ConvAE generalizability by replicating the study on EHRs from different healthcare institutions. We will also evaluate to what extent patient stratification and phenotyping hold promise in terms of personalized treatment recommendations. To this aim, we plan to first assess treatment safety and efficacy between subtypes of a specific disease. Finally, to develop more comprehensive disease characterizations, we will include other modalities of data, e.g., genetics, into this framework, which will hopefully refine clustering and reveal new etiologies. Multi-modal stratified disease cohorts promise to facilitate better predictive capabilities for future outcomes by modeling how molecular mechanisms interact with clinical states.


The framework to derive patient representations that enable stratification analysis at scale is based on 3 steps: 1) data pre-processing; 2) unsupervised representation learning (i.e., ConvAE); and 3) clustering analysis of disease-specific cohorts (see Figure 1a). In this section, we report details of this framework as well as the description of the evaluation design.


We used EHRs from the Mount Sinai Health System, a large and diverse urban hospital located in New York, NY, which generates a high volume of structured, semi-structured and unstructured data from inpatient, outpatient, and emergency room visits. Patients in the system can have up to years of follow-up data unless they are transferred or move their residence away from the hospital system. We accessed a de-identified dataset containing approximately million patients, spanning the years from to , that was made available for use under IRB approval following HIPAA guidelines.

For each patient, we aggregated general demographic details (i.e., age, sex, and race) and clinical descriptors. We included ICD-9 diagnosis codes, medications normalized to RxNorm, CPT-4 procedure codes, vital signs, and lab tests normalized to LOINC. ICD-10 codes were mapped-back to the corresponding ICD-9 versions. We pre-processed clinical notes using a tool based on the Open Biomedical Annotator that extracts clinical concepts from the free-text [jonquet2009, lependu2012]. The vocabulary was composed by clinical concepts.

We retained all patients with at least two concepts, resulting in a collection of different patients, with an average of records per patient. In particular, the cohort included females, males, and not declared; the mean age of the population as of was years (). Patients were randomly partitioned in half for 2-fold cross-validation. In each train set, we retained

random patients for validating the model hyperparameters. Train and test pre-processed sets’ details are reported in Supplementary Table 


Data pre-processing

Every patient in the dataset is represented as a longitudinal sequence of length of aggregated temporally-ordered medical concepts, i.e., , where each is a medical concept from the vocabulary . Pre-processing includes: 1) filtering the least and most frequent concepts; 2) dropping redundant concepts within fixed time frames; 3) splitting long sequences of records to include the complete patient history while leveraging the CNN framework, which requires fixed-size inputs.

Concept filtering

We consider all the EHRs as a document and each patient sequence as a sentence. For each concept in

we first compute the probability of having

in . We then multiply this by the sum of the probabilities to find in a sentence for all sentences. In particular, let be the set of all patients, , the filtering score is defined as:


where is the total number of sentences and is the length of a patient sequence. The filtering score combines document frequency, i.e., number of patients with at least one occurrence of , and term frequency, i.e., total number of occurrences of in a patient sequence. We then drop all concepts with filtering scores outside certain cut-off values to reduce the amount of noise (i.e., concepts that occur multiple times in few patients, are too general and not informative, or occur in many patients).

Duplicate analysis

A patient may have multiple encounters in their health records that span consecutive days and might include repeated concepts that are often artifacts of the EHR system, rather than new clinical entries. To reduce this bias, we drop all duplicate medical concepts from the patient records within overlapping time intervals of days. Within the same time window, we also randomly shuffle the medical concepts, given that events within the same encounter are generally randomly recorded [choi2016learning, glicksberg2017]. Lastly, we eliminate all patients with less than concepts in their records.

Record subsequencing

Patient sequences are then chopped into subsequences of fixed length that are used to train the ConvAE model. Each patient sequence is thus defined as:

and subsequences shorter than

are padded with

up to length . For the sake of clarity, in the following section we present the architecture as applied to a general subsequence .

The ConvAE architecture

ConvAE is a representation learning model that transforms patient EHR subsequences into low-dimensional, dense vectors. The architecture consists of three stacked modules (see Figure 1b). To the best of our knowledge, this is the first study proposing to use in combination embedding, CNNs, and autoencoders to process EHRs and to derive unsupervised vector-based patient representations that can be used for clinical inference and medical analysis.

Given , the architecture first assigns each medical concept to an -dimensional embedding vector to capture the semantic relationships between medical concepts. Specifically, a patient subsequence is represented as an matrix , where is the subsequence length, and is the embedding dimension. This structure also retains temporal information because the rows of matrix are temporally ordered according to patient visits.

The architecture is then composed by CNNs, which extract local temporal patterns, and AEs, which learn the embedded representations for each patient subsequence. The CNN applies temporal filters to each embedding matrix. CNN filters applied to EHRs usually perform a one-side convolution operation across time via filter sliding. A filter can be defined as , where is the variable window size and is the embedding dimension [zhu2016, suo2017personalized]. Our approach differs in that it processes embedding matrices as they were RGB images carrying a third “depth” dimension. With this approach, we enable the model filters to learn independent weights for each encoding dimension, thus activating for the most salient features in each dimension of the embedding space. Therefore, we reshape the embedding matrix into and we consider the embedding dimensions as channels. We then apply filters to padded input to keep the same output dimension and learn features that may grasp sequence characteristics. In particular, for each filter , we obtain:


where: is the output matrix; is the -dimensional weight matrix at depth ; is the input matrix -th embedding dimension;

is the bias vector; and

is the convolution function. We used Rectified Linear Unit (ReLU) as the activation function and max pooling. The output is then reshaped into a concatenated vector of dimension

. This configuration learns different weights for each embedding dimension to highlight relevant interdependencies of medical concepts, and tune representations of patient histories to identify the most relevant characteristics of their semantic space.

We then use fully dense layers of autoencoders to derive embedded patient representations that estimate the given input subsequences. Specifically, we extract the hidden representation

, a -dimensional vector, as the encoded representation of each patient subsequence. Each patient sequence is then transformed into a sequence of encodings that can be post-modeled to obtain a unique vector-based patient representation. Here we simply component-wise average all the subsequence representations.

To train ConvAE, we set up a multi-class classification task that reconstructs each initial input one-hot subsequence of medical terms, from their encoded representations. Given a subsequence of medical concepts , the ConvAE is trained by minimizing the Cross Entropy (CE) loss:

where is the output of ConvAE reshaped into a matrix of dimension , is the -th element of sequence that correspond to a term indexed in and:


Since the objective function consists of only self-reconstruction errors, the model can be trained without any supervised training samples.

Clustering analysis for patient stratification

ConvAE-based representations can be used to stratify patients from any preselected cohort without needing additional feature engineering or manual adjustments. To this aim, we select a cohort of patients with a specific disease using, e.g., ICD codes, SNOMED diagnosis, phenotyping algorithms (e.g., [wei2015, halpern2016, glicksberg2017], and run clustering to identify subgroups in the population. Here, we use hierarchical clustering with Ward’s method and Euclidean distance. We identify the number of subclusters that best disentangles heterogeneity on the disease dataset using the Elbow Method, which empirically selects the smallest number of clusters that minimize the increase in explained variance.

A systematic analysis of the patients in each subgroup can then automatically identify the medical concepts that significantly and uniquely define each disease subtype. In this work, we rank all the codes by their frequency in the patient sequences. In particular, we compute the percentages of patients whose sequence includes a specific concept both with respect to a subcluster (i.e., in-group frequency) and to the complete disease cohort (i.e., total frequency). Ranking maximizes, first, the in-group percentage, and second, the total percentage. We then analyze the most frequent concepts and we use a pairwise chi-squared test to determine whether the distributions of present/absent concepts with respect to the detected subgroups are significantly different [zhang2019data].

Evaluation design

This section describes the pipeline implemented to evaluate ConvAE for patient stratification at scale.

Implementation details

All model hyperparameters were empirically tuned to minimize the network reconstruction error, while balancing training efficiency and computation time. We tested a large amount of configurations (e.g., time interval equal to ; patient subsequence length equal to ; embedding dimension spanning ). For brevity, we report only the final setting used in the patient stratification experiments. All modules were implemented in Python , using scikit-learn and pytorch as machine learning libraries [scikit-learn, paszke2017]. Computations were run on a server with an Nvidia Titan V GPU. Code is available at:

We used equation (1) to discard terms with a filtering score less than , i.e., document frequency ranging from to . Examples of discarded concepts are clotrimazole, an antifungal medication, and torsemide, a medication to reduce extra fluid in the body. We decided to retain all the very frequent concepts as most of them seemed clinically informative (e.g., vital signs). Patients with less than medical concepts were then discarded. In total, medical terms were filtered out, decreasing the vocabulary size to .

We divided each patient history in consecutive, half-overlapped temporal windows of days, shuffled unique medical concepts and dropped redundant terms. Patient sequences were then split in subsequences of length concepts, obtaining about subsequences of medical concepts for training. This value was chosen to enable efficient training of the autoencoder with GPUs.

We initialized medical concept embeddings using word2vec with the skip-gram model [mikolov2013efficient]. We considered all the subsequences in the training set as sentences and medical concepts as words [glicksberg2017, choi2016learning]. We obtained -dimensional embeddings for medical concepts of the vocabulary. The remaining concepts were initialized randomly; the subsequence padding was initialized as the null vector (i.e., at ). These embedding vectors were then used as input for the ConvAE module and were further refined during the model training.

The CNN module used 50 filters with kernel size equal to 5 and activation function. The autoencoder was composed by 4 hidden layers with 200, 100, 200 and hidden nodes, respectively, where is the vocabulary size. We used activation in the first three layers and activation in the final layer to obtain continuous output. We applied dropout with in the first two layers for regularization. The model was trained using cross entropy loss with the Adam optimizer (learning rate = and weight decay = ) [kingma2014]

for 5 epochs on all training data and batch size of

. The size of the patient representations was equal to .

We evaluated different CNN configurations composed by 1-layer (i.e., “ConvAE 1-layer CNN”), 2-layers (i.e., “ConvAE 2-layer CNN”), and one multikernel layer (i.e., “ConvAE multikernel CNN”). All hyperparameters were the same, except the number of filters in the second CNN of the 2-layer configuration that was set to . Multikernel CNN performs parallel training of distinct CNNs with different kernel sizes, and concatenates the final outputs. We used kernel dimensions equal to , , and .


We compared ConvAE with the following representation learning algorithms: “RawCount”, “SVD-RawCount”, “SVD-TFIDF”, and “Deep Patient”. All baselines derived vector-based patient encodings of size .

RawCount is a sparse representation where each patient is encoded into a count vector that has the length of the vocabulary. More specifically, each individual health history is represented as an integer vector , where each element is the frequency of the corresponding clinical concept in the patient longitudinal history , i.e., .

SVD-RawCount applies truncated singular value decomposition (SVD) to the RawCount matrix to compute the largest singular values of the raw count encodings, which define the dense, lower-dimensional representations.

SVD-TFIDF transforms the raw count encodings using the term frequency–inverse document frequency (TFIDF) weighting schema and applies truncated SVD to the resulting matrix. We considered the patient EHR sequences as documents, the entire dataset as corpus and we derived TFIDF scores for all medical concepts. Each patient is then represented as a vector of length , with the corresponding TFIDF weight for each concept, and the matrix obtained is reduced via truncated SVD.

Deep Patient transforms the raw count matrix using a stack of denoising autoencoders as proposed by Miotto et al.

[miotto2016deep]. We used the implementation details presented in the paper, with batch size equal to , corruption noise equal to , and 5 training epochs.

Multi-Disease clustering analysis

We evaluated all the representation learning approaches in a clustering task to determine how they were able to disentangle patients with different conditions. We chose eight complex disorders: type 2 diabetes (T2D), multiple myeloma (MM), Parkinson’s disease (PD), Alzheimer’s disease (AD), Crohn’s disease (CD), prostate cancer (PC), breast cancer (BC) and attention deficit hyperactivity disorder (ADHD). We retrieved all the corresponding patients in the test sets using SNOMED–CT (Systematized Nomenclature of Medicine – Clinical Terms) codes after verifying that at least one correspondent ICD-9 code was present in a patient EHRs. In particular, we looked for Type 2 diabetes mellitus (250.00 for T2D; Multiple myeloma without mention of having achieved remission (203.00) for MM; Paralysis agitans (332.0) for PD; Alzheimer’s disease (331.0) for AD; Regional enteritis of unspecified site (555.9) for CD; Malignant neoplasm of prostate (185) for PC; Malignant neoplasm of female breast (174.9) for BC; and Attention deficit disorder with hyperactivity (314.01) for ADHD. We discarded all patients with comorbidities within the selected diseases to facilitate the clustering interpretation. We then performed hierarchical clustering with clusters (i.e., same as the different diseases) for all the representations to evaluate if patients with the same condition were grouping together. The final test sets were composed by about patients per fold but were unbalanced, with disease cohorts ranging from about to patients (see Supplementary Table LABEL:tab:disclass). To use balanced datasets and improve the efficacy of the experiment, we sub-sampled random patients for the highly populated diseases, and we iterated this subsampling process times, obtaining different clustering per test set.

We used Entropy and Purity averaged across the experiments to measure to what extent the clusters matched the different diseases. In particular, for each cluster , we define the probability that a patient in has disease as:


where is the number of patients in cluster and is the number of patients in cluster with a diagnosis of disease . Entropy for each cluster is defined as:


and conditional entropy is then computed as:

where is the total number of elements in the complex disease dataset.

Purity identifies the most represented disease in each cluster. For a cluster , purity is defined as , where is computed as before. The overall purity score is then the weighted average of for each cluster . The perfect clustering obtains averaged entropy and purity equal to and , respectively.

Disease subtyping analysis

We evaluated the usability of ConvAE representations to discover disease subtypes for different and diverse conditions (i.e., patient stratification at scale). In particular, we selected a cohort of patients with T2D, PD, AD, MM, PC, and BC and ran hierarchical clustering on the ConvAE-based patient representations. These are all age-related complex disorders with late onset (i.e., increased prevalence after years of age [cowie2018, lau2006, qiu2009, kazandjian2016, pc1, bc2]). We focused only on these conditions to attempt reducing confounding age effects that could affect the analysis of the subtypes (as it could happen on CD and ADHD cohorts, where a common onset age is less defined). To reduce noise in the sequence encodings, we averaged all patient subsequence representations from the first diagnosis forward, and we dropped sequences shorter than concepts. We ranged the number of clusters from to and we used the Elbow Method to empirically select the smallest number of clusters that minimize the increase in explained variance. We then performed a qualitative analysis of each subtype, similarly to Zhang et al. [zhang2019data], to identify which medical concepts characterized the specific group of patients. We further verified the various subgroups in the medical literature and with the support of a practicing clinician.


R.M. would like to thank the support from the Hasso Plattner Foundation, the Alzheimer’s Drug Discovery Foundation and a courtesy GPU donation from Nvidia. I.L. acknowledges the support from the Bruno Kessler Foundation.

Competing interests

The authors declare no competing interests.


I.L. and R.M. conceived and designed the work. I.L. conducted the research and the experimental evaluation, and wrote the manuscript. R.M. created the dataset, supervised the research and substantially edited the manuscript. B.S.G. substantially edited the manuscript and created the architecture figures. H.L. advised on methodological choices. S.C., H.L., and M.D. refined the manuscript. G.L. provided clinical validation of the results. J.T.D. and C.F. supported the research. All the authors reviewed the final manuscript.

8pt [title=References]

Entropy Purity N diseases
ConvAE 1-layer CNN () () ()
ConvAE 2-layer CNN () () ()
ConvAE multikernel CNN () () ()
RawCount () () ()
SVD-RawCount () () ()
SVD-TFIDF () () ()
DeepPatient () () ()
Mean (sd, CI);

Mean (standard deviation);

CNN = Convolutional Neural Network; DP = Deep Patient; SVD = Singular Value Decomposition;
TFIDF = Term Frequency - Inverse Document Frequency
Table 1: Clustering performances of ConvAE configurations and baselines. The scores reported are averaged over a 2-fold cross validation experiment. ConvAE 1-layer CNN significantly outperforms all other configurations and baselines on all measures. Multiple pairwise t-tests with Bonferroni correction are used to compare performances.
Figure 1: Framework enabling patient stratification analysis from deep unsupervised EHR representations (a); Details of the ConvAE representation learning architecture (b).

AD = Alzheimer’s disease; ADHD = Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;
MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes

Figure 2: Uniform Manifold Approximation and Projection (UMAP) encoding visualization. ConvAE 1-layer CNN (a); SVD-RawCount (b); SVD-TFIDF (c); DeepPatient (d).

AD = Alzheimer’s disease; ADHD = Attention deficit hyperactivity disorder; BC = Breast cancer; CD = Crohn’s disease;
MM = Multiple myeloma; PC = Prostate cancer; PD = Parkinson’s disease; T2D = Type 2 diabetes

Figure 3: Uniform Manifold Approximation and Projection (UMAP) clustering visualization. ConvAE 1-layer CNN (a); SVD-RawCount (b); SVD-TFIDF (c); DeepPatient (d).
Figure 4: Complex disorder subgroups. A subsample of patients with T2D is displayed in Figure (a).