1 Introduction
The rapid adoption of electronic health records (EHRs) has enabled the use of the EHR data for primary and secondary purposes, such as clinical process optimization, clinical decision support, treatment outcome improvement, clinical research, and epidemiological monitoring of the nation’s health hersh2007adding . An emerging use on the EHR data is to develop advanced machine learning models, primarily supervised learning, to discover new interconnections between diseases, facilitate precise prediction of health status, and help effectively prevent diseases or disabilities obermeyer2016predicting ; miotto2016deep ; xiao2018opportunities ; wang2018clinical . As opposed to supervised learning, unsupervised machine learning has been introduced to identify new patterns and relations in the irregularlysampled data without using human created labels lecun2015deep ; chen2018neural , mostly for predictive modeling, such as the prediction of patient health status miotto2016deep , disease progression trajectory prediction wang2014unsupervised , or phenotypes prediction pivovarov2015learning ; son2018deep . In this paper, we investigate the use of unsupervised machine learning in the discovery of latent disease clusters and patient subgroups using EHRs.
The problem of discovering potential disease clusters and patient subgroups is extremely important for the study of aging. According to the United Nations Population Division, the global share of older people (
60 years old) increased from 8% in 1950 and 9% in 1990 to 12 % in 2013, and will continue to grow to an estimated 21% in 2050
united2013world . The development of chronic illness plays an important role in this demographic shift. The older people who survive with chronic illnesses are more likely to develop additional chronic illnesses divo2014ageing . Traditionally, most comorbidities have been studied separately vanfleteren2013clusters , however, it is common for most older people to have two or more chronic morbidities. Therefore, discovering disease clusters will help in the systematic examination of all comorbidities for associations with a specific condition and improve risk assessment and future prediction. Moreover, identifying new disease clusters that may reflect underlying mechanisms (“latent traits”) would help define new domains of risk in a population. Discovering patient subgroups with the similar underlying disease patterns could facilitate diagnosis and treatment decision making, and epidemiological analysis and research schnell2016bayesian .In this study, we utilized Latent Dirichlet Allocation (LDA), an unsupervised generative probabilistic models, and proposed a novel model named Poisson Dirichlet Model (PDM), which extends the LDA approach for the EHR data. The proposed PDM uses a Poisson distribution to model patients’ disease diagnoses by considering both observed and expected observations that alleviates the impact of age and sex on a population. In the experiments, we evaluated LDA and PDM on the EHR data of three patient cohorts, namely Osteoporosis, Delirium/Dementia, and Chronic Obstructive Pulmonary Disease (COPD)/Bronchiectasis Cohorts, retrieved from the Rochester Epidemiology Project (REP) medical records linkage system, for the discovery of latent disease clusters and patient subgroups. We compared the effectiveness of LDA and PDM in identifying latent disease clusters through the visualization of disease representations learned by two approaches. We also tested the performance of LDA and PDM in differentiating patient subgroups through survival analysis, as well as statistical analysis of demographics and Elixhauser Comorbidity Index (ECI) scores in those subgroups. The experimental results show that the proposed PDM could effectively identify distinguished disease clusters based on the latent patterns hidden in the EHR data by alleviating the impact of age and sex, and that LDA could stratify patients into more differentiable subgroups than PDM in terms of pvalues. However, the subgroups discovered by PDM might imply the underlying patterns of diseases of greater interest in epidemiology research due to the alleviation of age and sex. Therefore, both unsupervised machine learning approaches could be leveraged to discover patient subgroups using EHRs but with different foci.
2 Background
Many integrative studies have assessed the relation and impact of multimorbidities, which could reveal disease clusters, potential pathobiology mechanisms, and improve our understanding of conditions in older people divo2014ageing . For example, Barabási et al. used the mathematical network theory to demonstrate disease coexistence in a graph model barabasi2011network . However, the disease network relies on the biological networks and interaction resources, and is limited due to the incompletion of our genome and phoneme knowledge ni2018constructing . The wide adoption of EHR data has provided an unprecedented opportunity to apply datadriven analysis to uncover new links in diseases and patients gligorijevic2016large . However, the complete capture of all diagnostic events in a geographicallydefined population is a key factor for discovering disease patterns.
The creation of medical record linkage systems that connect EHR data from multiple institutions could capture the entire health care experience of a geographically defined population. The REP is a pioneer linkage system developed through a collaboration between health care providers in southeastern Minnesota, and involves Olmsted Medical Center, Mayo Clinic, Rochester Family Medicine Clinic and other medical care providers in southeastern Minnesota melton1996history ; rocca2012history ; st2012data . It is a unique infrastructure for epidemiology and outcomes research that links the medical records of local health care providers to community residents. Enabled by the REP, previously studies evaluated morbidity occurrence one diagnosis at a time melton2013long and used traditional analytic techniques (e.g., tree models/recursive partitioning) to define disease clusters savica2013risk . However, these approaches do not adequately address the cooccurrence of multiple disease states within an individual.
A set of methods relevant to studying multimorbidities has arisen in the field of document processing under the rubric of “topic models”. Latent Dirichlet Allocation (LDA) is a typical topic modeling method proposed by Blei et al. blei2012probabilistic
. LDA categorizes all words in a collection of textual documents into a set of distinct “topics”, while simultaneously classifying each document by the topics it contains
zhao2014topic . A given word may be associated with multiple topics, and multiple topics may appear in a given document. Since a patient could also be represented by a set of diagnosed diseases that share similar undiscovered interrelations, we simply used the analogy “words”“diseases”, “documents”“patients”, “topics”“latent disease clusters”, and applied LDA to discover latent disease clusters in our previous work li2014discovering . Although our previous study showed the potential of LDA in leveraging hidden pattern information from EHR data, we failed to observe dichotomized latent disease clusters from the results of LDA. Moreover, another shortcoming of LDA, especially in a cohort that spans a large age range, is that it will identify clusters that are due primarily to age and/or sex, disease associations that are already known and thus not very interesting. Examples would be athletic injuries or vaccinations in the young and joint ailments in the old. LDA uses a Multinomial distribution to simulate the generative process of each single disease, which leads LDA to focus on the proportions of various diseases in the cohort. Of greater interest in epidemiology research is the prediction or clustering of excess risk, event rates that are over and above what would be expected for a given age and sex. Furthermore, we didn’t investigate the potential of topic models in discovering patient subgroups for a defined cohort. Stratifying patients into subgroups with similar characteristics and risks will not only facilitate epidemiological analysis and research, but enable personalized care that will improve the efficiency and effectiveness of disease prevention, diagnosis, and treatment.3 Methods
3.1 Mathematical Modeling
Suppose denotes a disease diagnosis code set with size , a cohort is represented as a group of patients , and a patient is represented as a sequence of disease diagnoses where is a disease diagnosis code from for patient . Given these notations, we describe LDA and the proposed PDM in this subsection.
Let denote the latent disease clusters, which is akin to topics in LDA. Suppose denotes the dimensionality of , a
dimensional Dirichlet random variable,
a dimensional parameter with , and a matrix parameter, we can define LDA as the following generative process for each patient in a cohort :
For each of latent disease clusters:

Choose Dirichlet


For each of patients in :

Choose Dirichlet().

For each of diseases that patient has been diagnosed:

Choose a latent disease cluster Multinomial().

Choose a disease Multinomial().


LDA can be represented as a graphical model at the upperleft of Figure 1. By applying Gibbs sampling, we can learn parameters and , and hyperparameters and in LDA. We could leverage
, the probability of a disease in a latent topic, to discover latent comorbidities that appear in the same disease cluster.
In this study, we also propose a novel unsupervised machine learning model, Poisson Dirichlet Model (PDM), which extends LDA for discovering disease clusters of excess risk. Based on a patient’s age, sex, and length of followups we can compute an expected number of diagnoses for subject m and diagnosis n, and compare the observed count to this expectation. We hypothesize that the PDM could be sensitive to patterns of excess disease risk, which will be different than overall risk, and that these patterns could identify more distinguished disease clusters than LDA.
With the same notations used for LDA, PDM assumes the following generative process for each patient in the cohort:

For each of latent disease clusters:

Choose Dirichlet


For each of patients in :

Choose Gamma(, ), where .

Choose Dirichlet().

For each of diseases that patient has been diagnosed:

Choose a latent disease cluster Multinomial().

Choose a disease count Poisson().


The proposed PDM model is represented as a probabilistic graphical model at the upperright of Figure 1. In this generative process, denotes the expected number of diagnoses for disease
that is computed by a simple rate model fit to the overall data. Specifically, the followups for each subject in the cohort were divided into bins based on single years of age and sex. For each disease diagnosis, the counts (number of occurrences) were modeled using a generalized additive model separately for males and females, assuming a Poisson error structure. Age was fit using a smoothing spline with 4 degrees of freedom and the log of personyears in each bin were treated as an offset. The expected event rates for each person were estimated using predictions from these models based on the age, sex, and followup of each person. After dividing each patient’s followup by age and sex,
can be derived mathematically by:where is a coefficient, is an offset of the personyears, is age, and follows a Poisson distribution.
Poisson distribution is utilized to generate the total number of each disease that has been diagnosed for a patient. is a positive multiplier for patient
generated from a Gamma distribution with mean equal to
. Since some patients may have a significantly larger number of diagnosis (e.g., sicker or better insurance) than others, is utilized as a normalizer on the number of patients’ diagnosis so that the parameters of Poisson distribution are not learned towards the extreme cases.Figure 1 also illustrates that LDA models the observed number of disease diagnoses while PDM takes advantage of the combination of observed and expected number of diagnoses. The use of expected data for each patient in PDM alleviates the impact of age and sex, and lessen the difficulty of LDA in handling missing data that were not collected during residents’ absence in the healthcare system, particularly in medical EHRs, while simultaneously incorporates the epidemiological characteristics of the population. Since the experimental data are from Olmsted County, we used the expected risk table of the Minnesota population. Since is precalculated before training PDM, it is treated as a fixedvalued constant in the parameter learning process.
3.2 Parameter Estimation
Markov Chain Monte Carlo (MCMC) methods are usually used to generate random samples that can be used in estimating the parameters of posterior distributions in a probabilistic machine learning model. MCMC constructs a Markov chain to converge to the target distribution, and generates samples from that Markov chain. Since the Dirichlet priors are conjugate to the Multinomial distributions, Gibbs sampling, a widely adopted MCMC algorithm, is utilized for inference of the LDA model griffiths2004finding . However, it cannot be applied to PDM since the Poisson distributions are not conjugate to the Dirichlet priors. Therefore, a more general MCMC sampling method, MetropolisHastings (MH) algorithm, was applied to approximate the distributions and learn the parameters of PDM. The MH sampling algorithm creates a Markov chain based on a proposal distribution and corrects the wrong density through an acceptancerejection step, in comparison with the Gibbs sampling algorithm that always accepts the proposal distribution. We briefly introduce the MH algorithm as below.
Suppose that denotes the true distributions we would like to inference, denotes the proposal distribution, is a new sample drawn from , and is the acceptance probability for defined by
the MH sampling algorithm is illustrated in Algorithm 1. We implemented the parameter estimation algorithm using JAGS^{1}^{1}1http://mcmcjags.sourceforge.net/ and rJAGS^{2}^{2}2https://cran.rproject.org/package=rjags, which automatically choose the proposal distribution for the sampling process. We utilized two sampling chains with a burnin of 500 iterations followed by 1000 iterations for inference.
3.3 Applications on EHRs
3.3.1 Discovering Latent Disease Clusters
Given a cohort of patients diagnosed with a certain disease, unsupervised machine learning approaches allow us to discover latent comorbidity clusters for that disease, which could help define new domains of risk. LDA and the proposed PDM model represent diseases in a latent topic space. The estimated parameter in LDA or PDM is a diseasetopic matrix indicating the probability of a disease occurring in a latent topic. We hypothesize that the diseases with similar characteristics would be automatically clustered in the latent topic space, which is called a latent disease cluster. In order to verify the effectiveness of LDA and PDM for identifying latent disease clusters, we qualitatively visualize the disease representation in the latent topic space using the diseasetopic matrix by applying a machine learning visualization method, tSNE maaten2008visualizing , which maps the disease representation in the highdimensional topic space into a twodimensional space that enables visualization. By doing so, we intuitively evaluate the potential of unsupervised machine learning in discovering disease clusters. Since the perplexity and the number of iterations are two important parameters for the tSNE, we tested different combinations and chose the ones generating the best visualization results for PDM and LDA.
3.3.2 Discovering Patient Subgroups
An interesting question in epidemiology study is whether we can stratify patients into subgroups so that the patients in the same subgroup have similar health characteristics and risks. Populationbased evidence has been shown to be a major source of support for medical decision making for an individual ledbetter2001toward
. Using the estimated parameters of LDA or PDM, we can calculate the posterior probability of a latent disease cluster for a given patient, i.e.,
, which is a patienttopic probability matrix. In order to discover patient subgroups, we could leverage clustering analysis on the patient feature vectors by using the rows of patientdisease cluster matrix. In our experiments, we tested three clustering algorithms, namely hierarchical clustering
ward1963hierarchical, Kmeans clustering
hartigan1979algorithm , and Birch clustering algorithms zhang1996birch , with five different numbers of subgroups, ranging from 2 to 6 subgroups.To evaluate these subgroups, we carried out survival analysis on each patient subgroup. We used the Logrank statistical test to compare the difference between the survival curves of patient subgroups. In addition to the survival analysis, we conducted statistical analysis on the demographics and number of diagnoses for patient subgroups. We also used the widely adopted comorbidity measure, Elixhauser Comorbidity Index (ECI) elixhauser1998comorbidity , to compare patient subgroups in each cohort. We compared ECI scores between patient subgroups, and 29 ECI categories, including congestive heart failure, cardiac arrhythmias, valvular disease, pulmonary circulation disorders, peripheral vascular disease, hypertension, paralysis, other neurological disorders, chronic pulmonary disease, diabetes (uncomplicated), diabetes (complicated), hypothyroidism, renal failure, liver disease, peptic ulcer disease (excluding bleeding), lymphoma, metastatic cancer, solid tumour (without metastasis), rheumatoid arthritis/collaged vascular disease, coagulopathy, obesity, weight loss, fluid and electrolyte disorders, blood loss anaemia, deficiency anaemia, alcohol abuse, drug abuse, psychoses, and depression. Our goal is to evaluate whether the patient subgroups discovered by the proposed PDM model could differentiate patients in a defined cohort.
4 Datasets
In this section, we describe the datasets used in the empirical experiments of testing LDA and the proposed PDM approach. Three cohorts extracted from the REP, namely the Osteoporosis Cohort, the Delirium/Dementia Cohort, and the Chronic Obstructive Pulmonary Disease (COPD)/Bronchiectasis Cohort, were utilized to evaluate the proposed PDM model. These cohorts were retrieved from the REP cohort that consisted of patients 50 years of age and older during the interval January 1, 1995 through December 31, 2011 from Olmsted County whose total number of disease diagnoses was between three hundred and five hundred and who had at least thirty diagnoses of osteoporosis (CCS code: 206), delirium/dementia (CCS code: 653), and COPD/bronchiectasis (CCS code: 127), respectively. These diseases have been shown to be related to aging with complicated comorbidities. We used Clinical Classifications Software (CCS)^{3}^{3}3https://www.hcupus.ahrq.gov/toolssoftware/ccs/ccs.jsp as the diagnosis taxonomy for each patient since the International Classification of Diseases (ICD) classification has too detailed granularity for clinical practice li2014discovering . Since the REP had created a longitudinal record spanning each subject’s entire period of residency in the community, we retrieved all their ICD9 diagnostic codes from the REP linkage system. We collapsed the ICD9 codes into 285 CCS categories and removed the less frequent codes. Table 1 lists the basic demographics of the three cohorts, including the number of patients (male and female), median age (male and female), and median observed number of diagnoses for the three cohorts. LDA and PDM were trained on the data of these cohorts, with five different numbers of latent disease clusters (). Eventually was chosen for both approaches based on outperforming experimental results.
Cohort  Osteoporosis Cohort  Delirium/Dementia Cohort  COPD/Bronchiectasis Cohort 

# Patients  388  304  685 
# Male (%)  21 (5.4%)  95 (31.2%)  337 (49.2%) 
# Female (%)  367 (94.6%)  209 (68.8%)  348 (50.8%) 
Median Age  74.4  83.6  73.2 
Median Age (Male)  74.7  85.0  75.1 
Median Age (Female)  68.8  81.6  71.1 
Median Observed # Diagnosis  406  387.5  402 
5 Results
5.1 Visualization of Latent Disease Clusters
We first compared the visualization of disease representations in the latent topic space learned by LDA and the proposed PDM model for three cohorts, as depicted in Figure 2. The perplexity parameters of tSNE for PDM and LDA are 10 and 20, respectively, and the number of iterations is 5000 for both approaches. It is obviously shown that the disease clusters are explicitly dichotomized by PDM while almost no disease clusters could be observed by LDA. Since LDA failed to identify latent disease clusters, we list the latent comorbidities that are clustered with osteoporosis, delirium/dementia, and COPD by PDM in Table 2. In order to verify the results, we tried to find evidence by searching PubMed^{4}^{4}4https://www.ncbi.nlm.nih.gov/pubmed/ articles’ title or abstract using keywords from the target disease and the latent comorbidities, in combination with the term “comorbidity” or “comorbidities”. For example, we found 31 PubMed research articles for osteoporosis and implant or graft, 52 articles for dementia and osteoporosis, and 30 articles for COPD and cerebrovascular disease. The latent comorbidities are not highly related to age and sex and thus the prediction or clustering of excess risk would be of more interest for epidemiological analysis. This result shows that the proposed PDM model is able to learn the latent patterns hidden in the EHR data that differentiate disease clusters by alleviating the impact of age and sex on the diseases.
LDA  PDM  
Osteoporosis Cohort  
Delirium/Dementia Cohort  
COPD/Bronchiectasis Cohort 
Disease  Latent Comorbidities 

Osteoporosis (CCS: 206)  Headache; including migraine (CCS: 84)
Nonspecific chest pain (CCS: 102) Diverticulosis and diverticulitis (CCS: 146) Complication of device; implant or graft (CCS: 237) Abdominal pain (CCS: 251) 
Delirium/dementia (CCS: 653)  Immunizations and screening for infectious disease (CCS: 10)
Conduction disorders (CCS: 105) Osteoporosis (CCS: 206) Fracture of lower limb (CCS: 230) 
COPD/bronchiectasis (CCS: 127)  Peri; endo; and myocarditis; cardiomyopathy (except that caused by tuberculosis or sexually transmitted disease) (CCS:97)
Aortic; peripheral; and visceral artery aneurysms (CCS: 115) Fracture of lower limb (CCS: 230) Late effects of cerebrovascular disease (CCS: 113) Complications of surgical procedures or medical care (CCS: 238) Biliary tract disease (CCS: 149) Other female genital disorders (CCS: 175) 
5.2 Validation of Patient Subgroups
In this section, we demonstrate the experimental results of patient subgroups discovered by LDA and the proposed PDM model for three cohorts. As aforementioned, we tested three clustering algorithms (i.e., hierarchical clustering, Kmeans clustering, and Birch clustering algorithms) and five different numbers of patient subgroups (i.e., 2, 3, 4, 5, 6) on the patienttopic matrix computed by LDA and PDM. We carried out survival analysis on these subgroups for each cohort. The pvalues of the Logrank test on the survival curves are listed in Table 3. From the table, we observe that LDA generallly produces patient subgroups with smaller pvalues than PDM. When LDA was utilized, the Kmeans clustering algorithm generates differentiated patient subgroups with statistical significance for both Osteoporosis and COPD/Bronchiectasis Cohorts when the number of patient subgroups is 2, and achieved the best pvalues () for the Delirium/Dementia Cohort when the number of patient subgroups is 3. The pvalues of patient subgroups generated by PDM were much higher than those by LDA: Kmeans clustering algorithm achieved pvalues of 0.071 and 0.00028 with the number of patient subgroups equal to 6 and 2 for the Delirium/Dementia and COPD Cohorts, respectively; and the Birch clustering algorithm produced a pvalue of 0.0085 for the Osteoporosis Cohort when the number of patient subgroups is 2. Overall, the Kmeans clustering algorithm outperforms other clustering algorithms in identifying patient subgroups. The patient subgroup results with the best pvalues from LDA and PDM with the smallest number of patient subgroups are chosen for further survival and comorbidity analysis.
Number of subgroups  2  3  4  5  6  

Unsupervised Methods  PDM  LDA  PDM  LDA  PDM  LDA  PDM  LDA  PDM  LDA 
Osteoporosis Cohort  
Hierarchical clustering  0.16  0.00017  0.28  0.00081  0.44  0.00075  0.41  0.00073  0.41  0.0001 
Kmeans clustering  0.39  0.0001  0.58  0.0001  0.38  0.0001  0.38  0.0001  0.6  0.00032 
Birch clustering  0.0085  0.002  0.015  0.0038  0.0098  0.00078  0.011  0.0021  0.018  0.0035 
Delirium/Dementia Cohort  
Hierarchical clustering  0.10  0.41  0.083  0.23  0.16  0.40  0.26  0.54  0.28  0.61 
Kmeans clustering  0.34  0.012  0.49  0.0051  0.50  0.011  0.42  0.029  0.071  0.0057 
Birch clustering  0.14  0.071  0.34  0.19  0.32  0.34  0.46  0.50  0.12  0.033 
COPD/Bronchiectasis Cohort  
Hierarchical clustering  0.017  0.0045  0.014  0.0001  0.035  0.0001  0.027  0.0001  0.021  0.0001 
Kmeans clustering  0.00028  0.0001  0.00032  0.0001  0.0017  0.0001  0.086  0.0001  0.0026  0.0001 
Birch clustering  0.15  0.00045  0.10  0.0001  0.15  0.0001  0.12  0.0001  0.2  0.0001 
5.2.1 Survival curves
KaplanMeier survival curves of the selected patient subgroups discovered by LDA and PDM are depicted in Figure 3 for three cohorts.
Osteoporosis Cohort  Delirium/Dementia Cohort  COPD/Bronchiectasis Cohort 
LDA(a)  LDA(b)  LDA(c) 
PDM(a)  PDM(b)  PDM(c) 
Survival analysis of patient subgroups discovered by LDA for the Osteoporosis Cohort depicted in Figure 3 LDA(a) showed significant difference between the survival curves of patient subgroups with p. The patients in Subgroup 1 have a obviously distinguished worse survival rate than those in Subgroup 2. Thus, at the point of clinical care, the patients in Subgroup 1 should receive more attention than those in Subgroup2. Figure 3 LDA(b) shows significant difference at level between the survival curves of patient subgroups for the Delirium/Dementia Cohort. Three identified patient subgroups are distinguishable with disparate survival curves. The patients in Subgroups 2 have a better survival rate than those in Subgroups 1 and 3. The patients in Subgroups 1 have the lowest survival rate across the survival time distribution. Figure 3 LDA(c) shows a significant difference between the survival curves of patient subgroups with p for the COPD/Bronchiectasis Cohort. The patients in Subgroup 2 have a prominent lower risk and better survival than those in Subgroup 1.
Figures 3 PDM(a), PDM(b), and PDM(c) show the survival analysis on the patient subgroups discovered by PDM for three cohorts. Figure 3 PDM(a) showed difference between the survival curves of patient subgroups at the level of 0.01 for the Osteoporosis Cohort. The survival curves are similar to those by LDA. The patients in Subgroup 2 have a better survival rate than those in Subgroup 1 when survival time is approximately 6 years. The survival probability of both subgroups drops dramatically when survival time is between 6 years and 8 years. The survival probability of patients in Subgroup 2 decreases more rapid than those in Subgroup 1 after survival time 11 years. Though Figure 3 PDM(b) does not show significant difference between the survival curves of patient subgroups for the Dementia Cohort, patient subgroups are distinguishable with disparate survival curves. For example, the patients in Subgroups 1, 3, and 6 have similar survival curves when survival time is approximately 5 years. However, the patients in Subgroups 6 have longer survival time than those in Subgroups 1 and 3. The patients in Subgroups 2 and 5 have better survival than other subgroups across the survival time distribution, and those in Subgroup 2 have better survival than those in Subgroup 5 when survival time is 7 years. Figure 3 PDM(c) shows that the survival curves of patient subgroups are different at the level of ¡0.001 for the COPD Cohort. The patients in Subgroup 1 have a noticeable better survival rate than those in Subgroup 2. These results explicitly show the effectiveness of unsupervised machine learning models in stratifying patients into groups with different risks.
5.2.2 Statistical analysis
Methods  LDA  PDM  
Subgroups  Subgroup 1  Subgroup 2  pvalue  Subgroup 1  Subgroup 2  pvalue  
# Patients (%)  271 (69.8%)  117 (30.2%)  216 (55.7%)  172 (44.3%)  
Sex  0.034  0.096  
# Male (%)  252 (93.0%)  115 (98.3%)  8 (3.7%)  13 (7.6%)  
# Female (%)  19 (7.0%)  2 (1.7%)  208 (96.3%)  159 (92.4%)  
Median Age  78.3  66.3  0.001  71.1  74.6  0.002  
Median # Diagnosis  414.0  390.0  0.007  405  408  0.815  
Median ECI  6.0  4.0  0.001  6.0  5.0  0.013  
ECI Groups  0.001  0.331  
# Patients in ECI (01)  7 (2.6%)  9 (7.7%)  9 (4.2%)  7 (4.1%)  
# Patients in ECI (24)  65 (24.0%)  64 (54.7%)  65 (30.1%)  64 (37.2%)  
# Patients in ECI (5+)  199 (73.4%)  44 (37.6%)  142 (65.7%)  101 (58.7%)  
ECI Categories with p  
Congestive heart failure  48 (17.7%)  5 (4.3%)  0.001  Psychoses  65 (30.1%)  23 (13.4%)  0.001 
Pulmonary circulation disorders  28 (10.3%)  1 (0.9%)  0.001  
Hypertension  209 (77.1%)  64 (54.7%)  0.001  
Other neurological disorders  85 (31.4%)  16 (13.7%)  0.001  
Weight loss  97 (35.8%)  16 (13.7%)  0.001  
Fluid and electrolyte disorders  139 (51.3%)  30 (25.6%)  0.001  
Psychoses  85 (31.4%)  3 (2.6%)  0.001 
Demographics and statistics of patient subgroups identified by LDA and PDM from the Osteoporosis Cohort. Statistically significance is based on the KruskalWallis Test (p
0.001). The complete analysis for all ECI categories is provided in the supplemental material.Methods  LDA  PDM  

Subgroups  Subgroup 1  Subgroup 2  Subgroup 3  pvalue  Subgroup 1  Subgroup 2  Subgroup 3  Subgroup 4  Subgroup 5  Subgroup 6  pvalue  
# Patients (%)  167 (55.0%)  72 (23.7%)  65 (21.3%)  47 (15.5%)  41 (13.5%)  34 (11.2%)  62 (20.4%)  56 (18.4%)  64 (21.1%)  
Sex  0.114  0.445  
# Male (%)  44 (26.3%)  28 (38.9%)  23 (35.4%)  16 (34.0%)  15 (36.6%)  9 (26.5%)  19 (30.6%)  12 (21.4%)  24 (37.5%)  
# Female (%)  123 (73.7%)  44 (61.1%)  42 (64.6%)  31 (66.0%)  26 (63.4%)  25 (73.5%)  43 (69.4%)  44 (78.6%)  40 (62.5%)  
Median Age  86.4  79.7  82.9  0.001  82.4  84.0  84.9  84.3  85.7  82.3  0.623  
Median # Diagnosis  385.0  414.0  376.0  0.009  383.0  399.0  385.5  409.0  411.5  368.5  0.007  
Median ECI  8.0  8.0  7.0  0.001  8.0  8.0  7.0  8.0  8.0  8.0  0.130  
ECI Groups  0.001  0.874  
# Patients in ECI (01)  0 (0.0%)  0 (0.0%)  0 (0.0%)  0 (0.0%)  0 (0.0%)  0 (0.0%)  0 (0.0%)  0 (0.0%)  0 (0.0%)  
# Patients in ECI (24)  5 (3.0%)  2 (2.8%)  12 (18.5%)  2 (4.3%)  2 (4.9%)  2 (5.9%)  3 (4.8%)  4 (7.1%)  6 (9.4%)  
# Patients in ECI (5+)  162 (97.0%)  70 (97.2%)  53 (81.5%)  45 (95.7%)  39 (95.1%)  32 (94.1%)  59 (95.2%)  52 (92.9%)  58 (90.6%)  
ECI Categories with p  
Hypertension  140 (83.8%)  58 (80.6%)  39 (60.0%)  0.001 
Methods  LDA  PDM  
Subgroup 1  Subgroup 2  pvalue  Subgroup 1  Subgroup 2  pvalue  
# Patients (%)  495 (72.3%)  190 (27.7%)  207 (30.2%)  478 (69.8%)  
Sex  0.005  0.489  
# Male (%)  260 (52.5%)  77 (40.5%)  106 (51.2%)  231 (48.3%)  
# Female (%)  235 (47.5%)  113 (59.5%))  101 (48.8%)  247 (51.7%)  
Median Age  76.0  66.8  0.001  71.1  74.6  0.002  
Median # Diagnosis  403.0  402.0  0.791  409.0  399.5  0.526  
Median ECI  9.0  6.0  0.001  7.0  8.0  0.001  
ECI Groups  0.001  0.016  
# Patients in ECI (01)  0 (0.0%)  2 (1.1%)  2 (1.0%)  0 (0.0%)  
# Patients in ECI (24)  15 (3.0%)  35 (18.4%))  21 (10.1%)  29 (6.1%)  
# Patients in ECI (5+)  480 (97.0%)  153 (80.5%)  184 (88.9%)  449 (93.9%)  
ECI Categories with p  
Congestive heart failure  264 (53.3%)  27 (14.2%)  0.001  Fluid and electrolyte disorders  111 (53.6%)  337 (70.5%))  0.001 
Cardiac arrhythmias  350 (70.7%)  102 (53.7%)  0.001  
Pulmonary circulation disorders  154 (31.1%)  23 (12.1%)  0.001  
Renal failure  123 (24.8%)  19 (10.0%)  0.001  
Obesity  75 (15.2%)  50 (26.3%)  0.001  
Weight loss  191 (38.6%)  37 (19.5%)  0.001  
Fluid and electrolyte disorders  380 (76.8%)  68 (35.8%)  0.001  
Deficiency anaemia  110 (22.2%)  21 (11.1%)  0.001  
Psychoses  140 (28.3%)  19 (10.0%)  0.001 
Tables 4, 5, and 6 list the statistics of demographics, number of diagnoses, and ECI scores for the patient subgroups identified by LDA and PDM for the Osteoporosis, Delirium/Dementia, and COPD/Bronchiectasis Cohorts, respectively. Statistically significance is based on the KruskalWallis Test (p 0.001). Reported also includes the ECI categories that are different among patient subgroups with statistical significance (pvalue) in each cohort. The complete analysis for each ECI category can be found in the supplemental material.
We first analyze the patient subgroups identified by LDA. As shown in Table 4 for the Osteoporosis Cohort, the difference of sex in two patient subgroups is not significantly different. The patients in Subgroup 1 are older than those in Subgroup 2 with statistically significance at the level of 0.001, which might be the reason that the patients of Subgroup 1 have worse survival rate in Figure 3 (a). Median ECI scores between two subgroups are statistically different at the level of 0.001. The patients in Subgroup 1 have a higher median ECI score, which means they have more comorbidities. 73.4% of patients in Subgroup 1 also have a larger number of comorbidites in terms of ECI (5+). The result is consistent with their survival analysis in Figure 3 (a). Seven ECI categories, including congestive heart failure, pulmonary circulation disorders, hypertension, other neurological disorders, weight loss, fluid and electrolyte disorders, and psychoses, are statistically different between two patient subgroups at the level of 0.001. The diseases in these ECI categories potentially contributed to differentiate patients into subgroups for the Osteoporosis Cohort.
For the patient subgroups identified by LDA from the Delirium/Dementia Cohort, there is statistical significant difference in age but not in sex among three subgroups. The fact that Subgroup 2 has better survival than Subgroup 1 is mainly due to the younger age of Subgroup 2. This result is consistent with previous findings that age is a strong risk factor for dementia gao1998relationships . No statistical significance is found in the number of diagnoses at the level of 0.001. The patients in Subgroup 1 and those in Subgroup 2 have a similar number of comorbidities in terms of median number of ECI and ECI (5+). Hypertension is the ECI category that has statistically significant difference between three subgroups, which is also consistent with the outcome of large observational studies that that hypertension plays a role in dementia and Alzheimer’s disease tzourio2007hypertension .
There is also statistical significant different in age but not in sex for the patient subgroups identified by LDA from the COPD/Bronchiectasis Cohort. The patients in Subgroup 1 have a higher median ECI score and a much larger number of comorbidities in terms of ECI (5+) than those in Subgroup 2 with a statistically significant difference at p. This result is consistent with the survival analysis in Figure 3 (c). Nine ECI categories are statistically different between two patient subgroups at the level of 0.001. These categories include congestive heart failure, cardiac arrhythmias, pulmonary circulation disorders, renal failure, obesity, weight loss, fluid and electrolyte disorders, deficiency anaemia, and psychoses. Interestingly, obesity appears to be associated with better outcome (15.2% in Subgroup 1 and 26.3% in Subgroup 2) and weight loss with worse outcome (38.6% in Subgroup 1 and 19.5% in Subgroup 2). That is likely related to the fact that being underweight in COPD/bronchiectasis is often a bad prognostic factor since the respiratory muscles lose strength during severe weight loss, which leads to respiratory failure. Being extremely obese also worsens COPD/bronchiectasis, but is not addressed in our analysis. As expected, psychosis are the comorbidities that worse COPD/bronchiectasis associated with the outcomes.
The first observation from the results of patient subgroups identified by PDM is that, unlike LDA, PDM does not differentiate patients into subgroups based on age and sex. No statistical significance are found in age and sex for the patient subgroups of three cohorts. This result validates the ability of PDM in removing the factors of age and sex for discovering patient subgroups, which enables analysis of hidden patterns of diseases that are of greater interest in epidemiology research. For the Osteoporosis Cohort, no statistical significance is found in the number of diagnoses, as well as the number of comorbidities. Only one ECI category, i.e., psychoses, is statistically different between two patient subgroups at the level of 0.001. For the Delirium/Dementia Cohort, no ECI category was found different with statistical significance among patient subgroups. This analysis could not find reasons why Subgroup 5 has the best survival according to the survival curves in Figure 3 PDM(b). This result implies that the conventional statistical analysis could not identify significant factors that cause differential patient subgroups discovered by PDM. In other words, PDM might leverage other undiscovered hidden disease patterns to discover these patient subgroups. For the COPD/Bronchiectasis Cohort, the patients in Subgroup 2 are older than those in Subgroup 1 but without statistical significance at the level of 0.001. ECI scores between two subgroups are statistically different with p. This result, similar to that of LDA, indicates that age portends worse prognosis as does greater comorbidities (ECI). Only one ECI categories, i.e., fluid and electrolyte disorders, are statistically different between two patient subgroups at the level of 0.001.
6 Discussion and Conclusion
In this study, we investigate the applications of unsupervised machine learning approaches in discovering latent disease clusters and patient subgroups using the EHR data. We utilized LDA, an unsupervised probabilistic generative model in the rubric of topic models, and proposed a novel unsupervised machine learning approach, named PDM. PDM extends the conventional LDA and uses a Poisson distribution to model patients’ disease diagnoses and to alleviate age and sex factors by considering both observed and expected observations. We applied LDA and PDM to two clinical use cases: discovery of latent disease clusters and patient subgroups using EHRs.
Both approaches were evaluated on the diagnostic EHR data of three cohorts, namely the Osteoporosis Cohort, the Delirium/Dementia Cohort, and the COPD/Bronchiectasis Cohort, retrieved from the REP medical linkage system. We verified the effectiveness of discovering latent disease clusters through the visualization of disease representation in the latent topic space. The 2D scattered plot showed that PDM discovered explicitly dichotomized disease clusters based on the latent patterns hidden in the EHR data than LDA. This result implies that we could utilize these disease clusters to identify multiple latent comorbities, which could be used to calculate excess risk above what would be expected for a given age and sex.
Furthermore, we applied LDA and PDM to discover patient subgroups, and carried out survival analysis on these subgroups. The experimental results show that LDA could stratify patients into more differentiable subgroups than PDM in terms of pvalues. However, those subgroups identified by LDA are highly associated with patients’ age and sex. Though the difference between the subgroups discovered by PDM has worse pvalues, these patient subgroups might imply the underlying patterns of diseases of greater interest in epidemiology research by alleviating the impact of age and sex. Therefore, the proposed PDM might be a better option than LDA for studying latent disease patterns in aging cohorts for which we would like to alleviate the impact of age and sex since they are major drivers of agingassociated diseases.
Due to the similarity of unsupervised machine learning and human learning, unsupervised learning is more closely aligned with artificial intelligence (AI), where a computer is expected to learn to identify complex processes and patterns without a human’s guidance. Compared to the supervised machine learning models that lack of generalizability and suffer from infeasibility of discovering novel patterns from EHRs
miotto2016deep , the unsupervised machine learning techniques utilized to discover particular comorbidity clusters from EHRs that may reflect underlying mechanisms (“latent traits”) would help define new domains of risk. The disease clusters discovered by PDM might contain new potential risks for diseases. Moreover, unsupervised machine learning approaches could be used to stratify patients into subgroups with similar characteristics and risks. Discovering differentiated patient subgroups will not only facilitate epidemiological analysis and research, but enable personalized care that will improve the efficiency and effectiveness of disease prevention, diagnosis, and treatment. We believe that both approaches could be leveraged to create an AI platform for exploiting the rich data resources of the REP, and likewise be served as a model for use by others with EHRs.The limitations of this study are fourfold. First, LDA and the proposed PDM model were tested on cohorts with a relative small number of patients and diagnoses due to the computational cost of MH algorithm. Faster MCMC methods should be considered in future work so that the proposed model could be scaled up to larger cohorts. Second, the time stamp of diagnosis, which is an important feature for epidemiology, was not considered in LDA and PDM models. Third, PDM was only evaluated on the EHR data from the REP. In the future, we will evaluate the generalizability of the proposed model on the EHR data from other institutions. Fourth, unsupervised machine learning approaches other than probabilistic generative models, such as deep neural network based models
miotto2016deep ; choi2016doctor ; choi2017gram , were not considered in this study. In the future, we will compare the proposed approaches with the deep neural network based models in discovering latent disease clusters and patient subgroups using EHRs.Acknowledgment
This work was supported by NIH grants P01AG004875, R01GM102282, U01TR002062, and R01LM011934, and made possible by the Rochester Epidemiology Project (R01AG034676) and the U.S. Public Health Service.
References
 (1) W. R. Hersh, “Adding value to the electronic health record through secondary use of data for quality assurance, research, and surveillance,” American Journal of Managed Care, vol. 13, no. 6, pp. 277–279, 2007.
 (2) Z. Obermeyer and E. J. Emanuel, “Predicting the future?big data, machine learning, and clinical medicine,” The New England journal of medicine, vol. 375, no. 13, p. 1216, 2016.
 (3) R. Miotto, L. Li, B. A. Kidd, and J. T. Dudley, “Deep patient: an unsupervised representation to predict the future of patients from the electronic health records,” Scientific reports, vol. 6, p. 26094, 2016.

(4)
C. Xiao, E. Choi, and J. Sun, “Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review,”
Journal of the American Medical Informatics Association, vol. 25, no. 10, pp. 1419–1428, 2018.  (5) Y. Wang, L. Wang, M. RastegarMojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, S. Sohn et al., “Clinical information extraction applications: a literature review,” Journal of biomedical informatics, vol. 77, pp. 34–49, 2018.
 (6) Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.

(7)
T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in
Advances in Neural Information Processing Systems, 2018, pp. 6572–6583.  (8) X. Wang, D. Sontag, and F. Wang, “Unsupervised learning of disease progression models,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2014, pp. 85–94.
 (9) R. Pivovarov, A. J. Perotte, E. Grave, J. Angiolillo, C. H. Wiggins, and N. Elhadad, “Learning probabilistic phenotypes from heterogeneous ehr data,” Journal of biomedical informatics, vol. 58, pp. 156–165, 2015.
 (10) J. H. Son, G. Xie, C. Yuan, L. Ena, Z. Li, A. Goldstein, L. Huang, L. Wang, F. Shen, H. Liu et al., “Deep phenotyping on electronic health records facilitates genetic diagnosis by clinical exomes,” The American Journal of Human Genetics, vol. 103, no. 1, pp. 58–73, 2018.
 (11) D. o. E. United Nations and P. D. Social Affairs, “World population ageing 2013,” New York: United Nations, 2013.
 (12) M. J. Divo, C. H. Martinez, and D. M. Mannino, “Ageing and the epidemiology of multimorbidity,” 2014.
 (13) L. E. Vanfleteren, M. A. Spruit, M. Groenen, S. Gaffron, V. P. van Empel, P. L. Bruijnzeel, E. P. Rutten, J. Op?t Roodt, E. F. Wouters, and F. M. Franssen, “Clusters of comorbidities based on validated objective measurements and systemic inflammation in patients with chronic obstructive pulmonary disease,” American journal of respiratory and critical care medicine, vol. 187, no. 7, pp. 728–735, 2013.
 (14) P. M. Schnell, Q. Tang, W. W. Offen, and B. P. Carlin, “A bayesian credible subgroups approach to identifying patient subgroups with positive treatment effects,” Biometrics, vol. 72, no. 4, pp. 1026–1036, 2016.
 (15) A.L. Barabási, N. Gulbahce, and J. Loscalzo, “Network medicine: a networkbased approach to human disease,” Nature reviews genetics, vol. 12, no. 1, p. 56, 2011.
 (16) P. Ni, J. Wang, P. Zhong, Y. Li, F. Wu, and Y. Pan, “Constructing disease similarity networks based on disease module theory,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2018.
 (17) D. Gligorijevic, J. Stojanovic, N. Djuric, V. Radosavljevic, M. Grbovic, R. J. Kulathinal, and Z. Obradovic, “Largescale discovery of diseasedisease and diseasegene associations,” Scientific reports, vol. 6, p. 32404, 2016.
 (18) L. J. Melton III, “History of the rochester epidemiology project,” Mayo Clinic Proceedings, vol. 71, no. 3, pp. 266–274, 1996.
 (19) W. A. Rocca, B. P. Yawn, J. L. S. Sauver, B. R. Grossardt, and L. J. Melton, “History of the rochester epidemiology project: half a century of medical records linkage in a us population,” Mayo Clinic proceedings, vol. 87, no. 12, pp. 1202–1213, 2012.
 (20) J. L. St Sauver, B. R. Grossardt, B. P. Yawn, L. J. Melton III, J. J. Pankratz, S. M. Brue, and W. A. Rocca, “Data resource profile: the rochester epidemiology project (rep) medical recordslinkage system,” International journal of epidemiology, vol. 41, no. 6, pp. 1614–1624, 2012.
 (21) r. Melton, LJ, S. Achenbach, E. Atkinson, T. M. Therneau, and S. Amin, “Longterm mortality following fractures at different skeletal sites: a populationbased cohort study,” Osteoporosis International, vol. 24, no. 5, pp. 1689–1696, 2013.
 (22) R. Savica, B. R. Grossardt, J. H. Bower, J. E. Ahlskog, and W. A. Rocca, “Risk factors for parkinson’s disease may differ in men and women: an exploratory study,” Hormones and behavior, vol. 63, no. 2, pp. 308–314, 2013.
 (23) D. M. Blei, “Probabilistic topic models,” Communications of the ACM, vol. 55, no. 4, pp. 77–84, 2012.
 (24) W. Zhao, W. Zou, and J. J. Chen, “Topic modeling for cluster analysis of large biological and medical datasets,” in BMC bioinformatics, vol. 15, no. 11. BioMed Central, 2014, p. S11.
 (25) D. C. Li, T. Therneau, C. Chute, and H. Liu, “Discovering associations among diagnosis groups using topic modeling,” AMIA Summits on Translational Science Proceedings, vol. 2014, p. 43, 2014.
 (26) T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National academy of Sciences, vol. 101, no. suppl 1, pp. 5228–5235, 2004.
 (27) L. v. d. Maaten and G. Hinton, “Visualizing data using tsne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
 (28) C. S. Ledbetter and M. W. Morgan, “Toward best practice: leveraging the electronic patient record as a clinical data warehouse,” Journal of Healthcare Information Management, vol. 15, no. 2, pp. 119–132, 2001.
 (29) J. H. Ward Jr, “Hierarchical grouping to optimize an objective function,” Journal of the American statistical association, vol. 58, no. 301, pp. 236–244, 1963.
 (30) J. A. Hartigan and M. A. Wong, “Algorithm as 136: A kmeans clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
 (31) T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: an efficient data clustering method for very large databases,” in ACM Sigmod Record, vol. 25, no. 2. ACM, 1996, pp. 103–114.
 (32) A. Elixhauser, C. Steiner, D. R. Harris, and R. M. Coffey, “Comorbidity measures for use with administrative data,” Medical care, pp. 8–27, 1998.
 (33) S. Gao, H. C. Hendrie, K. S. Hall, and S. Hui, “The relationships between age, sex, and the incidence of dementia and alzheimer disease: a metaanalysis,” Archives of general psychiatry, vol. 55, no. 9, pp. 809–815, 1998.
 (34) C. Tzourio, “Hypertension, cognitive decline, and dementia: an epidemiological perspective,” Dialogues in clinical neuroscience, vol. 9, no. 1, p. 61, 2007.

(35)
E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun, “Doctor ai: Predicting clinical events via recurrent neural networks,” in
Machine Learning for Healthcare Conference, 2016, pp. 301–318. 
(36)
E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun, “Gram: graphbased attention model for healthcare representation learning,” in
Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 787–795.