Disease prediction is an effective way to assess a person’s health status. Studies have shown that in many cases, there are identifiable indicators or preventable risk factors before the onset of the patient’s disease. These early warnings can effectively reduce the individual’s risk of disease. Theoretically, this can reduce the number of treatments needed and increase the necessary effective interventions. However, the combination of problem factors caused by different diseases and the patient’s past medical history are so complicated that no doctor can fully understand all of this. Currently, doctors can use family and health history and physical examinations to estimate the patient’s risk and guide laboratory tests to further evaluate the patient’s health. However, these sporadic and qualitative ”risk assessments” are usually only for a few diseases, depending on the experience, memory and time of the particular doctor. Therefore, the current medical care is after the fact. Once the symptoms of the disease appear, it is involved, rather than actively treating or eliminating the disease as soon as possible.
Today the prevailing model of prospective heath care is firmly based on the genome revolution. Indeed, technologies ranging from linkage equilibrium and candidate gene association studies to genome wide associations have provided an extensive list of disease-gene associations, offering us detailed information on mutations, SNPs, and the associated likelihood of developing specific disease phenotypes.
The basic assumption behind the research is that once we have classified all disease-related mutations, we can use various molecular biomarkers to predict each individual’s susceptibility to future diseases, thus bringing us into a predictive medicine era. However, these rapid advances have also revealed the limitations of genome-based methods. Considering that the signals provided by most disease-related SNPs or mutations are very weak, it is becoming increasingly clear that the prospect of genome-based methods may not be realized soon.
Does this mean that prospective disease prediction methods must wait until genomics methods are sufficiently mature? Our purpose is to prove that the method based on medical history provides hope for the prospective prediction of disease.
In this paper, we mainly study the disease prediction and comorbidity of diseases. Our approach is distinctly different in that we are trying to build a general predictive system which can utilize a less constrained feature space, i.e. taking into account all available demographics and previous medical history. Moreover, we rely primarily on ICD-10-CM (International Classification of Diseases, Tenth Revision, Clinical Modification) codes (see Section 2) for making predictions to account for the previous medical history, rather than specialized test results.
2.1 Source Data and Population
Our database comprises the medical records of 354,552 patients in China with a total of 2,904,257 hospital visits. The data was originally compiled from Insurance claims during 2007 to 2017. Such medical records are highly complete and accurate, and they are frequently used for epidemiological and demographic research.
The input for our methods consists of each patient’s personal information, such as gender, birthday, treatment-date, and diagnosis history, provided per patient’s visit. Each data record consists of a hospital visit, represented by a patient ID and a diagnosis code per visit, as defined by the International Classification of Diseases, Tenth Revision,Clinical Modification(ICD-10-CM). The International Statistical Classification of Diseases and Related Health Problems (ICD) provides codes to classify diseases and a wide variety of signs, symptoms, abnormal findings, social circumstances, and external causes of injury or disease. It is published by the World Health Organization.
Each disease or health condition is given a unique code, and can be up to 6 characters long, such as A01.001. The first character is a letter while the others are digits. ICD-10 codes are hierarchical in nature, so the 6 characters codes can be collapsed to fewer characters identifying a small family of related medical conditions. For instance, code A01.001 is a specific code for typhoid fever. This code can be collapsed to A01.
Moreover, we classify diseases of the same category into one class. For example, A90 is the code for Dengue fever (classical dengue) and A91 is the code for Dengue hemorrhagic fever. We classify them into the same class named F_A90. Thus, the 20 thousand origin ICD-10 codes are classified into 429 classes.
A sample patient medical history is shown in Table 1. Each line represents one hospital visit. Demographic data are also available.
In our medical database, the number of visits per patient ranges from 1 to 491, with a median of 4. Also, the average is 8.19. Table 2 shows the 20 most prevalent diseases in our database.
|Acute upper respiratory infection||20.88%|
|Hypertension and its complications||7.35%|
|Dermatitis and pruritus||3.75%|
|Gastritis and duodenitis||3.49%|
|Pulp, gum, and alveolar ridge diseases||3.15%|
|Hard tissue disease of teeth||2.73%|
|Abnormal uterine and vaginal bleeding||2.42%|
|Chronic rhinitis, nasopharyngitis and pharyngitis||1.99%|
|Non-infectious gastroenteritis and colitis||1.97%|
|Chronic ischemic heart disease||1.93%|
|Inflammation of the vagina and vulva||1.72%|
|Abnormal thyroid (parathyroid) function||1.62%|
|Acute lower respiratory infection||1.51%|
|Cervical disc disease||1.44%|
|Type II diabetes||1.37%|
|Female pelvic inflammatory disease||1.18%|
2.2 Quantifying the Strength of Comorbidity Relationships
In order to measure the correlation from disease comorbidity, we need to quantify the intensity of disease comorbidity by introducing the concept of distance between the two diseases. One difficulty of this method is that there are biases in different statistical measures, which overestimate or underestimate the relationship between rare or epidemic diseases. Given that the number of diagnoses (prevalence) for a particular disease follows a long tail distribution, these biases are important, which means that although most diseases are rarely diagnosed, a small number of diseases have been diagnosed in a large part of the population.
Therefore, quantifying comorbidity usually requires us to compare diseases that affect dozens of patients with diseases that affect millions of patients.
We will use two comorbidity measures to quantify the distance between two diseases: The Absolute Logarithmic Relative Risk (ALRR) and -correlation().
The Emperical Relative Risk of observing a pair of diseases and affecting the same patient is given by
where is the number of patients affected by both diseases, is the total number of patients in the population and and are the prevalences of diseases and .
Thus, the Relative Risk is defined as
is the transition probability from diseaseto disease and is the incidence probability of disease . The Absolute Logarithmic Relative Risk is defined as
-correlation, which is Pearson’s correlation for binary variables, can be expressed mathematically as
Therefore, the -correlation is defined as
These two comorbidity measures are not completely independent of each other, as they both increase with the number of patients affected by both diseases, yet both measures have their intrinsic biases. For example, RR overestimates relationships involving rare diseases and underestimates the comorbidity between highly prevalent illnesses, whereas accurately discriminates comorbidities between pairs of diseases of similar prevalence but underestimates the comorbidity between rare and common diseases.
In this section. We will formulate the maximum entropy method we used to predict disease risk.
3.1 Some notations
Suppose there are diseases and records. Let us use to denote disease . A record is a pair of diseases which means that there is a patient with a diagnosis of disease simultaneously or after disease . Let us use to denote record ().
Assume that . Here, and are maps.
is called the first disease and is called the second disease in record .
In this paper, we assume that and are surjective. If is not surjective, we can remove the diseases with indexes in from the medical history. Then the surjective assumption can be satisfied for . The same can be done for .
Let . Then
is the number of patients who suffer from disease before disease .
is a matrix with entries ranged in . if and only if there is a disease such that the patient in record suffer a second disease and the patient in record suffer a first disease . Our task is to evaluate the transition probability from record to record .
3.2 Entropy for Markov Chains
Suppose is a non-negative matrix. If for each , there exists such that , then is said to be irreducible.
Now we define the entropy for Markov chains. A matrixis called a skeleton matrix if its entries are either or . A non-negative matrix as called a Markov transition matrix if
Moreover, if , then is called the Markov transition matrix associated with the skeleton matrix .
For a non-negative vector, if
then is called a stationary distribution of .
For a non-negative matrix , if
and for all ,
Then is called a Markov weight matrix.
Here are some connections between Markov transition matrix and Markov weight matrix.
For a Markov transition matrix with stationary distribution . Define
Then it is easy to verify that is a Markov weight matrix.
On the contrary, given a Markov weight matrix , set
Then is the Markov weight matrix associated with .
Now we define the entropy for a Markov transitin matrix. First, let us consider the chain with length .
So the entropy for a chain with unit length is defined as
3.3 Maximum Entropy Theorem
The principle of maximum entropy is a basic principle in information theory(see e.g.
). It states that the probability distribution which best represents the current state of knowledge is the one with largest entropy. Since the distribution with the maximum entropy is the one that makes the fewest assumptions about the true distribution of data, the principle of maximum entropy can be seen as an application of Occam’s razor(see e.g.).
Suppose is irreducible. is the maximum eigenvalue of , and are the corresponding left and right eigenvectors with
Then the entropy of the Markov chain associated with the skeleton matrix attains the maximum when
Here, is the weight matrix for .
Recall that we have assumed that and are surjective. Therefore, and have rank . Since , we have that
Therefore, the eigenvalues of and are the same except for the zeros. In particular, the largest eigenvalue of and are the same.
Moreover, . Suppose is an eigenvector of with eigenvalue , then
Since and is injective as a map from to , . Hence, is the eigenvector of with eigenvalue .
3.4 Algorithm for Probability Estimation
Following is the algorithm for estimating the related probability.
3.5 Method for Disease Prediction
The prediction task is to predict the diseases that a person is most likely to have if we know that he has already suffered from diseases which are ordered by occurance.
We first calculate the probability by Algorithm 1. Then we construct the following quantity
Then we sort the and choose the top 5 disease as the predicted diseases for a person.
We also make an additional assumption, that is, the latest disease take a highest weight. Thus, some modifications are made. We first construct a decreasing sequence such that . (For example, ). Then we modify as
And we use the modified to choose the top 5 disease as the predicted diseases for a person.
4.1 Data Cleaning
The diseases are classified by F-code as described in section 2, and there are 429 F-coded diseases in total. If someone suffered from the same disease for many times, then we keep the earliest record and remove the others. For example, for the patient with patientid=123770, she suffered from mucopurulent conjunctivitis on 2015-12-09 and 2016-03-08, then the record with 2016-03-08 is removed from the history.
We clean the data and collect the records of the same patient together into one record. The history column recorded the patient’s disease history and the diseases are sorted by time and separated by a comma.
The following table is a sample of the cleaned data.
4.2 Calculate the Probability
We construct the matrix as follows.
First we initialize a matrix with entries equal to . We also contruct a map to index the diseases. Next, for history , we set
Thus, we establish the matrix .
Next, we use the power method to calculate the maximum eigenvalue of and the corresponding left and right eigenvectors and .
After that, we derive the Markov weight matrix and the transition probability as described in Algorithm 1.
Finally, we can calculate as described in subsection 3.5 and derive the related disease prediction.
4.3 Results of Accuracy
To compare the result, We use a method used previously by the insurance company as the benchmark. This method is called the emperical methods, that is, to calculate the incidence rate of diseases and use the top 5 prevalent diseases as predicition for each person.
We use 300,000 people’s records to calculate and also the top 5 diseases. For another 10,000 people, we use their records from 2007-2014 to calculate the diseases with the highest which is described in the previous section and choose the 5 diseases with highest -score as prediction, which is known as the maximum entropy method.
Then we examine the diseases they suffer from during 2015-2017 to see how many diseases is accurately predicted by these two methods.
The measurement we use is called the hit rate. It is defined as follows.
where A is the disease set predicted by the model and B is the disease set that a person suffer from during 2015-2017.
If we predict 5 diseases using the maximum entropy method, the hit rate is 31.89%. As a contrast, the hit rate is 16.55% for the empirical method.
We also compare the hit rate with 1/2/3/4 predictions for the two methods. The following table summarize the result.
|number of predictions||maximum entropy method||empirical method|
We can see from the table that the hit rate of the maximum entropy method is approximately twice that of the empirical method.
4.4 Comorbidity Analysis
We first study the ALRR. Recall that is calculated as follows.
If disease and disease are independent, then is close to . So if is large, then disease and disease are highly correlated.
If disease is high blood pressure and disease is type II diabetes. Then
Here is list of diseases with high LRR.
|Type II diabetes||Type I diabetes||3.40|
|Pulmonary heart disease||Acute ischemic heart disease||3.35|
|Type II diabetes||atherosclerosis||3.18|
|Diseases of lip, tongue and oral mucosa||
|Metabolic disorders||renal failure||2.84|
|high blood pressure||Type II diabetes||2.47|
|high blood pressure||renal failure||2.46|
|heart failure||Anal and rectal disorders||2.41|
|Type II diabetes||high blood pressure||2.39|
|heart failure||Peptic ulcer||2.38|
|high blood pressure||atherosclerosis||2.36|
|Alzheimer disease||Sleep disorders||2.33|
|high blood pressure||
|Over nutrition||Other diabetes||2.27|
|Pulmonary heart disease||Arrhythmia||2.22|
|high blood pressure||heart failure||2.21|
|Pituitary hyperfunction||Joint disorder||2.12|
|Pulmonary heart disease||arthritis||2.08|
The following table is a list of diseases such that differs from .
|Female pelvic inflammatory disease||Trichomoniasis||1.13||3.40|
|nephritic nephrotic syndrome||heart failure||1.51||3.35|
|Metabolic disorders||Malignant tumor of skin||1.37||3.19|
|Esophageal diseases||Splenic diseases||3.40||1.58|
|Other diabetes||Central nervous system diseases||1.26||2.84|
|Benign tumor of uterus||
|Arrhythmia||Mental and behavioral disorders||2.58||1.03|
|Arrhythmia||Diseases of autonomic nervous system||2.63||1.47|
|Headache syndrome||Other diseases of arteries and arterioles||1.58||2.74|
|Type II diabetes||2.78||1.70|
|Ankylosis and other spondylosis||Hypopituitarism||1.87||0.80|
|Arthrosis||Myasthenia and primary muscle diseases||2.71||1.65|
|Malignant tumors of digestive organs||Hemangioma and lymphangioma||3.40||2.33|
|Refractive and accommodative disorders||glaucoma||2.41||1.35|
|Other diabetes||Optic neuropathy||1.71||2.74|
|Chronic ischemic heart disease||Pericardial disease||1.26||2.28|
|Type II diabetes||Over nutrition||0.82||1.83|
Next, we consider the -correlation. Recall that
Next table display 20 disease pairs with high -correlation.
|Type II diabetes||Hypertension and its complications||67.75|
|Arrhythmia||Hypertension and its complications||58.55|
|Shingles||Dermatitis and pruritus||55.30|
|Benign uterine tumor||Abnormal uterine and vaginal bleeding||46.75|
|Other disorders of kidney and ureter||Other disorders of the urinary system||42.30|
|Other disorders of male reproductive organs||Prostatic hyperplasia and prostatitis||40.69|
|Urethral disorders||Other disorders of the urinary system||39.66|
|Other disorders of bone||Osteoporosis without pathological fracture||34.96|
|Other diseases of the digestive system||Gastritis and duodenitis||29.68|
|Type II diabetes||Metabolic disorders||27.51|
|Type II diabetes||Dermatitis and pruritus||27.16|
We can see from table 5 many disease pairs with large , such as type II diabetes and hypertension and its complications, which imply that such diseases have intrinsic relations. Tedesco  have mentioned that Hypertension is frequently associated with diabetes mellitus and its prevalence doubles in diabetics compared to the general population. This high prevalence is associated with increased stiffness of large arteries. Our result is consistent with their medical research.
In this paper, we propose a maximum entropy method for predicting disease risks. It is based on a patient’s medical history with diseases coded in ICD-10 which can be used in various cases. The complete algorithm with strict mathematical derivation is given. We also present experimental results on a medical dataset, demonstrating that our method performs well in predicting future disease risks and achieves an accuracy rate twice that of the traditional method. We also perform a comorbidity analysis to reveal the intrinsic relation of diseases.
We would thank Franco Mueller, Jonathan Brezin and Matt Grayson for their collaboration on an early version of this research.
Proof of Theorem 3.1.
Suppose is the weight matrix. Then the entropy of the Markov chain can be rewritten as
Let us construct the Lagrangian
If , we have that
If , then . Therefore,
Set , then
By the Perron-Frobenius theorem(see e.g. ), There are no nonnegative eigenvectors for other than the Perron vector and its positive multiples. Hence,
On the other hand,
Therefore, there exists such that
and . It follows that
Since , ,
Next, we will prove Theorem 3.2. We first prove an auxiliary lemma.
Suppose is the maximum eigenvalue of and are matrices in section 3.1. are the corresponding left and right eigenvectors with
Proof of Theorem 3.2.
Suppose is the weight matrix of , then
By the above lemma, we have that
Then are the left and right eigenvectors of corresponding to the eigenvalue . And
-  Glasgow R.E. et al, Does the chronic care model serve also as a template for improving prevention, Milbank Q, 2001, 79(4):579-612.
-  Cesar A. Hidalgo, Nicholas Blumm, Albert-Laszlo Barabasi, Nicholas A. Christakis, A Dynamic Network Approach for the Study of Human Phenotypes, PLoS Computational Biology, 2009, 5, e1000353.
-  M. A. Tedesco, F. Natale, G. Di. Salvo, S. Caputo, M. Capasso & R. Calabro, Effects of coexisting hypertension and type II diabetes mellitus on arterial stiffness, Journal of Human Hypertension, 2004, 18:469-473.
-  Roque, F. S. et al., Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Computational Biology, 2011, 7, e1002141.
-  Melton, G. B. et al., Inter‑patient distance metrics using SNOMED CT defining relationships, J. Biomed. Inform., 2006, 39:697-705.
-  Shannon, C.E., A Mathematical Theory of Communication, Bell System Technical Journal, 1948, 27:379-423.
-  Maurer, A., Ockham’s Razor and Chatton’s Anti-Razor, Mediaeval Studies, 1948, 46:463-475.
-  Perlis, R. H. et al. Using electronic medical records to enable large‑scale studies in psychiatry: treatment resistant depression as a model Psychol. Med., 2012, 42:41-50.
Thomas E. Booth, Power Iteration Method for the Several Largest Eigenvalues and Eigenfunctions,Nuclear Science and Engineering, 2006, 154(1):48-62.
-  Bruce Kitchens, Symbolic dynamics, one-sided, two-sided, and countable state Markov shifts, 1998, Springer.
-  Carl D. Meyer, Matrix analysis and applied linear algebra. With solutions to problems, 2001, SIAM.