1 Introduction
Disease prediction is an effective way to assess a person’s health status. Studies have shown that in many cases, there are identifiable indicators or preventable risk factors before the onset of the patient’s disease. These early warnings can effectively reduce the individual’s risk of disease. Theoretically, this can reduce the number of treatments needed and increase the necessary effective interventions. However, the combination of problem factors caused by different diseases and the patient’s past medical history are so complicated that no doctor can fully understand all of this. Currently, doctors can use family and health history and physical examinations to estimate the patient’s risk and guide laboratory tests to further evaluate the patient’s health. However, these sporadic and qualitative ”risk assessments” are usually only for a few diseases, depending on the experience, memory and time of the particular doctor. Therefore, the current medical care is after the fact. Once the symptoms of the disease appear, it is involved, rather than actively treating or eliminating the disease as soon as possible.
Today the prevailing model of prospective heath care is firmly based on the genome revolution. Indeed, technologies ranging from linkage equilibrium and candidate gene association studies to genome wide associations have provided an extensive list of diseasegene associations, offering us detailed information on mutations, SNPs, and the associated likelihood of developing specific disease phenotypes.
The basic assumption behind the research is that once we have classified all diseaserelated mutations, we can use various molecular biomarkers to predict each individual’s susceptibility to future diseases, thus bringing us into a predictive medicine era. However, these rapid advances have also revealed the limitations of genomebased methods. Considering that the signals provided by most diseaserelated SNPs or mutations are very weak, it is becoming increasingly clear that the prospect of genomebased methods may not be realized soon.
Does this mean that prospective disease prediction methods must wait until genomics methods are sufficiently mature? Our purpose is to prove that the method based on medical history provides hope for the prospective prediction of disease.
In this paper, we mainly study the disease prediction and comorbidity of diseases. Our approach is distinctly different in that we are trying to build a general predictive system which can utilize a less constrained feature space, i.e. taking into account all available demographics and previous medical history. Moreover, we rely primarily on ICD10CM (International Classification of Diseases, Tenth Revision, Clinical Modification) codes (see Section 2) for making predictions to account for the previous medical history, rather than specialized test results.
2 Data
2.1 Source Data and Population
Our database comprises the medical records of 354,552 patients in China with a total of 2,904,257 hospital visits. The data was originally compiled from Insurance claims during 2007 to 2017. Such medical records are highly complete and accurate, and they are frequently used for epidemiological and demographic research.
The input for our methods consists of each patient’s personal information, such as gender, birthday, treatmentdate, and diagnosis history, provided per patient’s visit. Each data record consists of a hospital visit, represented by a patient ID and a diagnosis code per visit, as defined by the International Classification of Diseases, Tenth Revision,Clinical Modification(ICD10CM). The International Statistical Classification of Diseases and Related Health Problems (ICD) provides codes to classify diseases and a wide variety of signs, symptoms, abnormal findings, social circumstances, and external causes of injury or disease. It is published by the World Health Organization.
Each disease or health condition is given a unique code, and can be up to 6 characters long, such as A01.001. The first character is a letter while the others are digits. ICD10 codes are hierarchical in nature, so the 6 characters codes can be collapsed to fewer characters identifying a small family of related medical conditions. For instance, code A01.001 is a specific code for typhoid fever. This code can be collapsed to A01.
Moreover, we classify diseases of the same category into one class. For example, A90 is the code for Dengue fever (classical dengue) and A91 is the code for Dengue hemorrhagic fever. We classify them into the same class named F_A90. Thus, the 20 thousand origin ICD10 codes are classified into 429 classes.
A sample patient medical history is shown in Table 1. Each line represents one hospital visit. Demographic data are also available.
patient_id  gender  treatment_date  code 

14532  F  20111015  F_M47 
14532  F  20111119  F_N91 
14532  F  20121009  F_L20 
14532  F  20121019  F_N60 
14532  F  20130508  F_B37 
14532  F  20130604  F_H10 
14532  F  20130615  F_K04 
14532  F  20130823  F_L20 
In our medical database, the number of visits per patient ranges from 1 to 491, with a median of 4. Also, the average is 8.19. Table 2 shows the 20 most prevalent diseases in our database.
Disease  Prevalence 

Acute upper respiratory infection  20.88% 
Hypertension and its complications  7.35% 
Dermatitis and pruritus  3.75% 
Gastritis and duodenitis  3.49% 
Chronic bronchitis  3.28% 
Pulp, gum, and alveolar ridge diseases  3.15% 
Hard tissue disease of teeth  2.73% 
Abnormal uterine and vaginal bleeding  2.42% 
Chronic rhinitis, nasopharyngitis and pharyngitis  1.99% 
Noninfectious gastroenteritis and colitis  1.97% 
Chronic ischemic heart disease  1.93% 
Inflammation of the vagina and vulva  1.72% 
Pneumonia  1.63% 
Abnormal thyroid (parathyroid) function  1.62% 
Other diabetes  1.56% 
Backache  1.54% 
Acute lower respiratory infection  1.51% 
Cervical disc disease  1.44% 
Type II diabetes  1.37% 
Female pelvic inflammatory disease  1.18% 
2.2 Quantifying the Strength of Comorbidity Relationships
In order to measure the correlation from disease comorbidity, we need to quantify the intensity of disease comorbidity by introducing the concept of distance between the two diseases. One difficulty of this method is that there are biases in different statistical measures, which overestimate or underestimate the relationship between rare or epidemic diseases. Given that the number of diagnoses (prevalence) for a particular disease follows a long tail distribution, these biases are important, which means that although most diseases are rarely diagnosed, a small number of diseases have been diagnosed in a large part of the population.
Therefore, quantifying comorbidity usually requires us to compare diseases that affect dozens of patients with diseases that affect millions of patients.
We will use two comorbidity measures to quantify the distance between two diseases: The Absolute Logarithmic Relative Risk (ALRR) and correlation().
The Emperical Relative Risk of observing a pair of diseases and affecting the same patient is given by
where is the number of patients affected by both diseases, is the total number of patients in the population and and are the prevalences of diseases and .
Thus, the Relative Risk is defined as
Here,
is the transition probability from disease
to disease and is the incidence probability of disease . The Absolute Logarithmic Relative Risk is defined asThe emperical
correlation, which is Pearson’s correlation for binary variables, can be expressed mathematically as
Therefore, the correlation is defined as
These two comorbidity measures are not completely independent of each other, as they both increase with the number of patients affected by both diseases, yet both measures have their intrinsic biases. For example, RR overestimates relationships involving rare diseases and underestimates the comorbidity between highly prevalent illnesses, whereas accurately discriminates comorbidities between pairs of diseases of similar prevalence but underestimates the comorbidity between rare and common diseases.
3 Methodology
In this section. We will formulate the maximum entropy method we used to predict disease risk.
3.1 Some notations
Suppose there are diseases and records. Let us use to denote disease . A record is a pair of diseases which means that there is a patient with a diagnosis of disease simultaneously or after disease . Let us use to denote record ().
Assume that . Here, and are maps.
is called the first disease and is called the second disease in record .
In this paper, we assume that and are surjective. If is not surjective, we can remove the diseases with indexes in from the medical history. Then the surjective assumption can be satisfied for . The same can be done for .
Assume that
Denote by
Let . Then
is the number of patients who suffer from disease before disease .
is a matrix with entries ranged in . if and only if there is a disease such that the patient in record suffer a second disease and the patient in record suffer a first disease . Our task is to evaluate the transition probability from record to record .
3.2 Entropy for Markov Chains
Suppose is a nonnegative matrix. If for each , there exists such that , then is said to be irreducible.
Now we define the entropy for Markov chains. A matrix
is called a skeleton matrix if its entries are either or . A nonnegative matrix as called a Markov transition matrix ifMoreover, if , then is called the Markov transition matrix associated with the skeleton matrix .
For a nonnegative matrix , if
and for all ,
Then is called a Markov weight matrix.
Here are some connections between Markov transition matrix and Markov weight matrix.
For a Markov transition matrix with stationary distribution . Define
Then it is easy to verify that is a Markov weight matrix.
On the contrary, given a Markov weight matrix , set
Then is the Markov weight matrix associated with .
Now we define the entropy for a Markov transitin matrix. First, let us consider the chain with length .
So the entropy for a chain with unit length is defined as
3.3 Maximum Entropy Theorem
The principle of maximum entropy is a basic principle in information theory(see e.g.
[6]). It states that the probability distribution which best represents the current state of knowledge is the one with largest entropy. Since the distribution with the maximum entropy is the one that makes the fewest assumptions about the true distribution of data, the principle of maximum entropy can be seen as an application of Occam’s razor(see e.g.
[7]).Theorem 3.1.
Suppose is irreducible.
is the maximum eigenvalue of
, andare the corresponding left and right eigenvectors with
Then the entropy of the Markov chain associated with the skeleton matrix attains the maximum when
Here, is the weight matrix for .
Proof.
see Appendix.
∎
Theorem 3.2.
Suppose is irreducible. is the maximum eigenvalue of , and are the corresponding left and right eigenvectors with
Then the entropy of the Markov chain associated with the skeleton matrix attains the maximum when
Here, is the weight matrix for .
Proof.
see Appendix.
∎
Remark 3.1.
Recall that we have assumed that and are surjective. Therefore, and have rank . Since , we have that
Therefore, the eigenvalues of and are the same except for the zeros. In particular, the largest eigenvalue of and are the same.
Moreover, . Suppose is an eigenvector of with eigenvalue , then
Since and is injective as a map from to , . Hence, is the eigenvector of with eigenvalue .
3.4 Algorithm for Probability Estimation
Following is the algorithm for estimating the related probability.
3.5 Method for Disease Prediction
The prediction task is to predict the diseases that a person is most likely to have if we know that he has already suffered from diseases which are ordered by occurance.
We first calculate the probability by Algorithm 1. Then we construct the following quantity
Then we sort the and choose the top 5 disease as the predicted diseases for a person.
We also make an additional assumption, that is, the latest disease take a highest weight. Thus, some modifications are made. We first construct a decreasing sequence such that . (For example, ). Then we modify as
And we use the modified to choose the top 5 disease as the predicted diseases for a person.
4 Experiments
4.1 Data Cleaning
The diseases are classified by Fcode as described in section 2, and there are 429 Fcoded diseases in total. If someone suffered from the same disease for many times, then we keep the earliest record and remove the others. For example, for the patient with patientid=123770, she suffered from mucopurulent conjunctivitis on 20151209 and 20160308, then the record with 20160308 is removed from the history.
We clean the data and collect the records of the same patient together into one record. The history column recorded the patient’s disease history and the diseases are sorted by time and separated by a comma.
The following table is a sample of the cleaned data.
patient_id  history 

123770  F_H10,F_H00 
135086  F_M65,F_J00,F_K29,F_K01,F_K04 
400195  F_J00,F_K29 
3218331  F_J00,F_J40 
119151  F_J00,F_N60 
102519  F_J00,F_L50,F_E34,F_E01 
1503387  F_K29,F_K01,F_I83 
7044682  F_J00,F_J20,F_J40 
182660  F_E01,F_J00,F_J20 
1888934  F_K31,F_K29,F_K22,F_K50,F_J00,F_J40,F_J20,F_M70,F_J30,F_J34 
4.2 Calculate the Probability
We construct the matrix as follows.
First we initialize a matrix with entries equal to . We also contruct a map to index the diseases. Next, for history , we set
Thus, we establish the matrix .
Next, we use the power method to calculate the maximum eigenvalue of and the corresponding left and right eigenvectors and .
After that, we derive the Markov weight matrix and the transition probability as described in Algorithm 1.
Finally, we can calculate as described in subsection 3.5 and derive the related disease prediction.
4.3 Results of Accuracy
To compare the result, We use a method used previously by the insurance company as the benchmark. This method is called the emperical methods, that is, to calculate the incidence rate of diseases and use the top 5 prevalent diseases as predicition for each person.
We use 300,000 people’s records to calculate and also the top 5 diseases. For another 10,000 people, we use their records from 20072014 to calculate the diseases with the highest which is described in the previous section and choose the 5 diseases with highest score as prediction, which is known as the maximum entropy method.
Then we examine the diseases they suffer from during 20152017 to see how many diseases is accurately predicted by these two methods.
The measurement we use is called the hit rate. It is defined as follows.
where A is the disease set predicted by the model and B is the disease set that a person suffer from during 20152017.
If we predict 5 diseases using the maximum entropy method, the hit rate is 31.89%. As a contrast, the hit rate is 16.55% for the empirical method.
We also compare the hit rate with 1/2/3/4 predictions for the two methods. The following table summarize the result.
number of predictions  maximum entropy method  empirical method 

1  15.01%  7.54% 
2  20.50%  10.67% 
3  25.21%  13.55% 
4  29.01%  14.92% 
5  31.89%  16.55% 
We can see from the table that the hit rate of the maximum entropy method is approximately twice that of the empirical method.
4.4 Comorbidity Analysis
We first study the ALRR. Recall that is calculated as follows.
If disease and disease are independent, then is close to . So if is large, then disease and disease are highly correlated.
If disease is high blood pressure and disease is type II diabetes. Then
Here is list of diseases with high LRR.
disease  disease  

Type II diabetes  Type I diabetes  3.40  
Pulmonary heart disease  Acute ischemic heart disease  3.35  
Type II diabetes  atherosclerosis  3.18  
Diseases of lip, tongue and oral mucosa 

2.98  
heart failure  Arrhythmia  2.96  
Metabolic disorders  renal failure  2.84  
emphysema  asthma  2.80  
pneumonia  Bronchiectasis  2.67  
high blood pressure  Type II diabetes  2.47  
high blood pressure  renal failure  2.46  
alopecia  Seborrheic keratosis  2.41  
heart failure  Anal and rectal disorders  2.41  
Type II diabetes  high blood pressure  2.39  
heart failure  Peptic ulcer  2.38  
high blood pressure  atherosclerosis  2.36  
Alzheimer disease  Sleep disorders  2.33  
high blood pressure 

2.33  
Over nutrition  Other diabetes  2.27  
Pulmonary heart disease  Arrhythmia  2.22  
high blood pressure  heart failure  2.21  
Pituitary hyperfunction  Joint disorder  2.12  
Pulmonary heart disease  arthritis  2.08 
The following table is a list of diseases such that differs from .
disease  disease  

Female pelvic inflammatory disease  Trichomoniasis  1.13  3.40  
nephritic nephrotic syndrome  heart failure  1.51  3.35  
Metabolic disorders  Malignant tumor of skin  1.37  3.19  
Esophageal diseases  Splenic diseases  3.40  1.58  
anemia  hypotension  2.89  1.30  
Other diabetes  Central nervous system diseases  1.26  2.84  
Benign tumor of uterus 

2.89  1.31  
Arrhythmia  Mental and behavioral disorders  2.58  1.03  
Arthrosis  epilepsy  0.71  1.89  
Arrhythmia  Diseases of autonomic nervous system  2.63  1.47  
Headache syndrome  Other diseases of arteries and arterioles  1.58  2.74  

Type II diabetes  2.78  1.70  
asthma  emphysema  1.73  2.80  
Ankylosis and other spondylosis  Hypopituitarism  1.87  0.80  
Arthrosis  Myasthenia and primary muscle diseases  2.71  1.65  
Malignant tumors of digestive organs  Hemangioma and lymphangioma  3.40  2.33  
Refractive and accommodative disorders  glaucoma  2.41  1.35  
Other diabetes  Optic neuropathy  1.71  2.74  
Chronic ischemic heart disease  Pericardial disease  1.26  2.28  
Type II diabetes  Over nutrition  0.82  1.83 
Next, we consider the correlation. Recall that
Next table display 20 disease pairs with high correlation.
disease  disease  

sleep disorder  68.23  
Type II diabetes  Hypertension and its complications  67.75  
Headache syndrome 

60.72  
Arrhythmia  Hypertension and its complications  58.55  
Muscle disorders  Backache  57.28  
Shingles  Dermatitis and pruritus  55.30  
Headache syndrome  Backache  53.33  
Benign uterine tumor  Abnormal uterine and vaginal bleeding  46.75  


42.54  
Other disorders of kidney and ureter  Other disorders of the urinary system  42.30  
anemia 

40.85  
cellulitis 

40.74  
Other disorders of male reproductive organs  Prostatic hyperplasia and prostatitis  40.69  
Urethral disorders  Other disorders of the urinary system  39.66  
Other disorders of bone  Osteoporosis without pathological fracture  34.96  

Chronic bronchitis  33.39  
Other diseases of the digestive system  Gastritis and duodenitis  29.68  
Type II diabetes  Metabolic disorders  27.51  
Type II diabetes  Dermatitis and pruritus  27.16  
Arrhythmia  sleep disorder  27.11 
We can see from table 5 many disease pairs with large , such as type II diabetes and hypertension and its complications, which imply that such diseases have intrinsic relations. Tedesco [3] have mentioned that Hypertension is frequently associated with diabetes mellitus and its prevalence doubles in diabetics compared to the general population. This high prevalence is associated with increased stiffness of large arteries. Our result is consistent with their medical research.
5 Conclusions
In this paper, we propose a maximum entropy method for predicting disease risks. It is based on a patient’s medical history with diseases coded in ICD10 which can be used in various cases. The complete algorithm with strict mathematical derivation is given. We also present experimental results on a medical dataset, demonstrating that our method performs well in predicting future disease risks and achieves an accuracy rate twice that of the traditional method. We also perform a comorbidity analysis to reveal the intrinsic relation of diseases.
Acknowledgement
We would thank Franco Mueller, Jonathan Brezin and Matt Grayson for their collaboration on an early version of this research.
Appendix
Proof of Theorem 3.1.
Suppose is the weight matrix. Then the entropy of the Markov chain can be rewritten as
Let us construct the Lagrangian
If , we have that
If , then . Therefore,
Set , then
By the PerronFrobenius theorem(see e.g. [11]), There are no nonnegative eigenvectors for other than the Perron vector and its positive multiples. Hence,
And
On the other hand,
Therefore, there exists such that
Hence,
Recall that
and . It follows that
Since , ,
∎
Next, we will prove Theorem 3.2. We first prove an auxiliary lemma.
Lemma 1.
Suppose is the maximum eigenvalue of and are matrices in section 3.1. are the corresponding left and right eigenvectors with
Then
(1) 
Proof.
Proof of Theorem 3.2.
Suppose is the weight matrix of , then
By the above lemma, we have that
Set
Then are the left and right eigenvectors of corresponding to the eigenvalue . And
Hence,
∎
References
 [1] Glasgow R.E. et al, Does the chronic care model serve also as a template for improving prevention, Milbank Q, 2001, 79(4):579612.
 [2] Cesar A. Hidalgo, Nicholas Blumm, AlbertLaszlo Barabasi, Nicholas A. Christakis, A Dynamic Network Approach for the Study of Human Phenotypes, PLoS Computational Biology, 2009, 5, e1000353.
 [3] M. A. Tedesco, F. Natale, G. Di. Salvo, S. Caputo, M. Capasso & R. Calabro, Effects of coexisting hypertension and type II diabetes mellitus on arterial stiffness, Journal of Human Hypertension, 2004, 18:469473.
 [4] Roque, F. S. et al., Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Computational Biology, 2011, 7, e1002141.
 [5] Melton, G. B. et al., Inter‑patient distance metrics using SNOMED CT defining relationships, J. Biomed. Inform., 2006, 39:697705.
 [6] Shannon, C.E., A Mathematical Theory of Communication, Bell System Technical Journal, 1948, 27:379423.
 [7] Maurer, A., Ockham’s Razor and Chatton’s AntiRazor, Mediaeval Studies, 1948, 46:463475.
 [8] Perlis, R. H. et al. Using electronic medical records to enable large‑scale studies in psychiatry: treatment resistant depression as a model Psychol. Med., 2012, 42:4150.

[9]
Thomas E. Booth, Power Iteration Method for the Several Largest Eigenvalues and Eigenfunctions,
Nuclear Science and Engineering, 2006, 154(1):4862.  [10] Bruce Kitchens, Symbolic dynamics, onesided, twosided, and countable state Markov shifts, 1998, Springer.
 [11] Carl D. Meyer, Matrix analysis and applied linear algebra. With solutions to problems, 2001, SIAM.