Log In Sign Up

Disease Prediction with a Maximum Entropy Method

by   Michael Shub, et al.

In this paper, we propose a maximum entropy method for predicting disease risks. It is based on a patient's medical history with diseases coded in ICD-10 which can be used in various cases. The complete algorithm with strict mathematical derivation is given. We also present experimental results on a medical dataset, demonstrating that our method performs well in predicting future disease risks and achieves an accuracy rate twice that of the traditional method. We also perform a comorbidity analysis to reveal the intrinsic relation of diseases.


page 1

page 2

page 3

page 4


Interpretable Disease Prediction based on Reinforcement Path Reasoning over Knowledge Graphs

Objective: To combine medical knowledge and medical data to interpretabl...

DSR: A Collection for the Evaluation of Graded Disease-Symptom Relations

The effective extraction of ranked disease-symptom relationships is a cr...

Common human diseases prediction using machine learning based on survey data

In this era, the moment has arrived to move away from disease as the pri...

Study on the numerical value and the distance of symptoms on the Medical Decision Support System

It is an important subject how deal with the symptom's data, input data,...

Comparing learning algorithms in neural network for diagnosing cardiovascular disease

Today data mining techniques are exploited in medical science for diagno...

Precision disease networks (PDN)

This paper presents a method for building patient-based networks that we...

A Tractable Inference Algorithm for Diagnosing Multiple Diseases

We examine a probabilistic model for the diagnosis of multiple diseases....

1 Introduction

Disease prediction is an effective way to assess a person’s health status. Studies have shown that in many cases, there are identifiable indicators or preventable risk factors before the onset of the patient’s disease. These early warnings can effectively reduce the individual’s risk of disease. Theoretically, this can reduce the number of treatments needed and increase the necessary effective interventions. However, the combination of problem factors caused by different diseases and the patient’s past medical history are so complicated that no doctor can fully understand all of this. Currently, doctors can use family and health history and physical examinations to estimate the patient’s risk and guide laboratory tests to further evaluate the patient’s health. However, these sporadic and qualitative ”risk assessments” are usually only for a few diseases, depending on the experience, memory and time of the particular doctor. Therefore, the current medical care is after the fact. Once the symptoms of the disease appear, it is involved, rather than actively treating or eliminating the disease as soon as possible.

Today the prevailing model of prospective heath care is firmly based on the genome revolution. Indeed, technologies ranging from linkage equilibrium and candidate gene association studies to genome wide associations have provided an extensive list of disease-gene associations, offering us detailed information on mutations, SNPs, and the associated likelihood of developing specific disease phenotypes.

The basic assumption behind the research is that once we have classified all disease-related mutations, we can use various molecular biomarkers to predict each individual’s susceptibility to future diseases, thus bringing us into a predictive medicine era. However, these rapid advances have also revealed the limitations of genome-based methods. Considering that the signals provided by most disease-related SNPs or mutations are very weak, it is becoming increasingly clear that the prospect of genome-based methods may not be realized soon.

Does this mean that prospective disease prediction methods must wait until genomics methods are sufficiently mature? Our purpose is to prove that the method based on medical history provides hope for the prospective prediction of disease.

In this paper, we mainly study the disease prediction and comorbidity of diseases. Our approach is distinctly different in that we are trying to build a general predictive system which can utilize a less constrained feature space, i.e. taking into account all available demographics and previous medical history. Moreover, we rely primarily on ICD-10-CM (International Classification of Diseases, Tenth Revision, Clinical Modification) codes (see Section 2) for making predictions to account for the previous medical history, rather than specialized test results.

2 Data

2.1 Source Data and Population

Our database comprises the medical records of 354,552 patients in China with a total of 2,904,257 hospital visits. The data was originally compiled from Insurance claims during 2007 to 2017. Such medical records are highly complete and accurate, and they are frequently used for epidemiological and demographic research.

The input for our methods consists of each patient’s personal information, such as gender, birthday, treatment-date, and diagnosis history, provided per patient’s visit. Each data record consists of a hospital visit, represented by a patient ID and a diagnosis code per visit, as defined by the International Classification of Diseases, Tenth Revision,Clinical Modification(ICD-10-CM). The International Statistical Classification of Diseases and Related Health Problems (ICD) provides codes to classify diseases and a wide variety of signs, symptoms, abnormal findings, social circumstances, and external causes of injury or disease. It is published by the World Health Organization.

Each disease or health condition is given a unique code, and can be up to 6 characters long, such as A01.001. The first character is a letter while the others are digits. ICD-10 codes are hierarchical in nature, so the 6 characters codes can be collapsed to fewer characters identifying a small family of related medical conditions. For instance, code A01.001 is a specific code for typhoid fever. This code can be collapsed to A01.

Moreover, we classify diseases of the same category into one class. For example, A90 is the code for Dengue fever (classical dengue) and A91 is the code for Dengue hemorrhagic fever. We classify them into the same class named F_A90. Thus, the 20 thousand origin ICD-10 codes are classified into 429 classes.

A sample patient medical history is shown in Table 1. Each line represents one hospital visit. Demographic data are also available.

patient_id gender treatment_date code
14532 F 2011-10-15 F_M47
14532 F 2011-11-19 F_N91
14532 F 2012-10-09 F_L20
14532 F 2012-10-19 F_N60
14532 F 2013-05-08 F_B37
14532 F 2013-06-04 F_H10
14532 F 2013-06-15 F_K04
14532 F 2013-08-23 F_L20
Table 1: Medical History Sample

In our medical database, the number of visits per patient ranges from 1 to 491, with a median of 4. Also, the average is 8.19. Table 2 shows the 20 most prevalent diseases in our database.

Disease Prevalence
Acute upper respiratory infection 20.88%
Hypertension and its complications 7.35%
Dermatitis and pruritus 3.75%
Gastritis and duodenitis 3.49%
Chronic bronchitis 3.28%
Pulp, gum, and alveolar ridge diseases 3.15%
Hard tissue disease of teeth 2.73%
Abnormal uterine and vaginal bleeding 2.42%
Chronic rhinitis, nasopharyngitis and pharyngitis 1.99%
Non-infectious gastroenteritis and colitis 1.97%
Chronic ischemic heart disease 1.93%
Inflammation of the vagina and vulva 1.72%
Pneumonia 1.63%
Abnormal thyroid (parathyroid) function 1.62%
Other diabetes 1.56%
Backache 1.54%
Acute lower respiratory infection 1.51%
Cervical disc disease 1.44%
Type II diabetes 1.37%
Female pelvic inflammatory disease 1.18%
Table 2: 20 Most Prevalent Diseases

2.2 Quantifying the Strength of Comorbidity Relationships

In order to measure the correlation from disease comorbidity, we need to quantify the intensity of disease comorbidity by introducing the concept of distance between the two diseases. One difficulty of this method is that there are biases in different statistical measures, which overestimate or underestimate the relationship between rare or epidemic diseases. Given that the number of diagnoses (prevalence) for a particular disease follows a long tail distribution, these biases are important, which means that although most diseases are rarely diagnosed, a small number of diseases have been diagnosed in a large part of the population.

Therefore, quantifying comorbidity usually requires us to compare diseases that affect dozens of patients with diseases that affect millions of patients.

We will use two comorbidity measures to quantify the distance between two diseases: The Absolute Logarithmic Relative Risk (ALRR) and -correlation().

The Emperical Relative Risk of observing a pair of diseases and affecting the same patient is given by

where is the number of patients affected by both diseases, is the total number of patients in the population and and are the prevalences of diseases and .

Thus, the Relative Risk is defined as


is the transition probability from disease

to disease and is the incidence probability of disease . The Absolute Logarithmic Relative Risk is defined as

The emperical

-correlation, which is Pearson’s correlation for binary variables, can be expressed mathematically as

Therefore, the -correlation is defined as

These two comorbidity measures are not completely independent of each other, as they both increase with the number of patients affected by both diseases, yet both measures have their intrinsic biases. For example, RR overestimates relationships involving rare diseases and underestimates the comorbidity between highly prevalent illnesses, whereas accurately discriminates comorbidities between pairs of diseases of similar prevalence but underestimates the comorbidity between rare and common diseases.

3 Methodology

In this section. We will formulate the maximum entropy method we used to predict disease risk.

3.1 Some notations

Suppose there are diseases and records. Let us use to denote disease . A record is a pair of diseases which means that there is a patient with a diagnosis of disease simultaneously or after disease . Let us use to denote record ().

Assume that . Here, and are maps.

is called the first disease and is called the second disease in record .

In this paper, we assume that and are surjective. If is not surjective, we can remove the diseases with indexes in from the medical history. Then the surjective assumption can be satisfied for . The same can be done for .

Assume that

Denote by

Let . Then

is the number of patients who suffer from disease before disease .

is a matrix with entries ranged in . if and only if there is a disease such that the patient in record suffer a second disease and the patient in record suffer a first disease . Our task is to evaluate the transition probability from record to record .

3.2 Entropy for Markov Chains

Suppose is a non-negative matrix. If for each , there exists such that , then is said to be irreducible.

Now we define the entropy for Markov chains. A matrix

is called a skeleton matrix if its entries are either or . A non-negative matrix as called a Markov transition matrix if

Moreover, if , then is called the Markov transition matrix associated with the skeleton matrix .

For a non-negative vector

, if

then is called a stationary distribution of .

For a non-negative matrix , if

and for all ,

Then is called a Markov weight matrix.

Here are some connections between Markov transition matrix and Markov weight matrix.

For a Markov transition matrix with stationary distribution . Define

Then it is easy to verify that is a Markov weight matrix.

On the contrary, given a Markov weight matrix , set

Then is the Markov weight matrix associated with .

Now we define the entropy for a Markov transitin matrix. First, let us consider the chain with length .

So the entropy for a chain with unit length is defined as

3.3 Maximum Entropy Theorem

The principle of maximum entropy is a basic principle in information theory(see e.g.


). It states that the probability distribution which best represents the current state of knowledge is the one with largest entropy. Since the distribution with the maximum entropy is the one that makes the fewest assumptions about the true distribution of data, the principle of maximum entropy can be seen as an application of Occam’s razor(see e.g.


Theorem 3.1.

Suppose is irreducible.

is the maximum eigenvalue of

, and

are the corresponding left and right eigenvectors with

Then the entropy of the Markov chain associated with the skeleton matrix attains the maximum when

Here, is the weight matrix for .


see Appendix.

Theorem 3.2.

Suppose is irreducible. is the maximum eigenvalue of , and are the corresponding left and right eigenvectors with

Then the entropy of the Markov chain associated with the skeleton matrix attains the maximum when

Here, is the weight matrix for .


see Appendix.

Remark 3.1.

Recall that we have assumed that and are surjective. Therefore, and have rank . Since , we have that

Therefore, the eigenvalues of and are the same except for the zeros. In particular, the largest eigenvalue of and are the same.

Moreover, . Suppose is an eigenvector of with eigenvalue , then

Since and is injective as a map from to , . Hence, is the eigenvector of with eigenvalue .

3.4 Algorithm for Probability Estimation

Following is the algorithm for estimating the related probability.

Step 1. Compute the matrix . Step 2. Use power method to compute the maximum eigenvalue of with the corresponding left and right eigenvectors and . Let
Step 3. Compute the weight matrix as follows.
Step 4. Compute the transition probability as follows.
Step 5. Compute the stationary distribution as follows.
Algorithm 1 Probability Estimation

3.5 Method for Disease Prediction

The prediction task is to predict the diseases that a person is most likely to have if we know that he has already suffered from diseases which are ordered by occurance.

We first calculate the probability by Algorithm 1. Then we construct the following quantity

Then we sort the and choose the top 5 disease as the predicted diseases for a person.

We also make an additional assumption, that is, the latest disease take a highest weight. Thus, some modifications are made. We first construct a decreasing sequence such that . (For example, ). Then we modify as

And we use the modified to choose the top 5 disease as the predicted diseases for a person.

4 Experiments

4.1 Data Cleaning

The diseases are classified by F-code as described in section 2, and there are 429 F-coded diseases in total. If someone suffered from the same disease for many times, then we keep the earliest record and remove the others. For example, for the patient with patientid=123770, she suffered from mucopurulent conjunctivitis on 2015-12-09 and 2016-03-08, then the record with 2016-03-08 is removed from the history.

We clean the data and collect the records of the same patient together into one record. The history column recorded the patient’s disease history and the diseases are sorted by time and separated by a comma.

The following table is a sample of the cleaned data.

patient_id history
123770 F_H10,F_H00
135086 F_M65,F_J00,F_K29,F_K01,F_K04
400195 F_J00,F_K29
3218331 F_J00,F_J40
119151 F_J00,F_N60
102519 F_J00,F_L50,F_E34,F_E01
1503387 F_K29,F_K01,F_I83
7044682 F_J00,F_J20,F_J40
182660 F_E01,F_J00,F_J20
1888934 F_K31,F_K29,F_K22,F_K50,F_J00,F_J40,F_J20,F_M70,F_J30,F_J34
Table 3: Sample of Cleaned Data

4.2 Calculate the Probability

We construct the matrix as follows.

First we initialize a matrix with entries equal to . We also contruct a map to index the diseases. Next, for history , we set

Thus, we establish the matrix .

Next, we use the power method to calculate the maximum eigenvalue of and the corresponding left and right eigenvectors and .

After that, we derive the Markov weight matrix and the transition probability as described in Algorithm 1.

Finally, we can calculate as described in subsection 3.5 and derive the related disease prediction.

4.3 Results of Accuracy

To compare the result, We use a method used previously by the insurance company as the benchmark. This method is called the emperical methods, that is, to calculate the incidence rate of diseases and use the top 5 prevalent diseases as predicition for each person.

We use 300,000 people’s records to calculate and also the top 5 diseases. For another 10,000 people, we use their records from 2007-2014 to calculate the diseases with the highest which is described in the previous section and choose the 5 diseases with highest -score as prediction, which is known as the maximum entropy method.

Then we examine the diseases they suffer from during 2015-2017 to see how many diseases is accurately predicted by these two methods.

The measurement we use is called the hit rate. It is defined as follows.

where A is the disease set predicted by the model and B is the disease set that a person suffer from during 2015-2017.

If we predict 5 diseases using the maximum entropy method, the hit rate is 31.89%. As a contrast, the hit rate is 16.55% for the empirical method.

We also compare the hit rate with 1/2/3/4 predictions for the two methods. The following table summarize the result.

number of predictions maximum entropy method empirical method
1 15.01% 7.54%
2 20.50% 10.67%
3 25.21% 13.55%
4 29.01% 14.92%
5 31.89% 16.55%
Table 4: Comparisons of Hit Rate

We can see from the table that the hit rate of the maximum entropy method is approximately twice that of the empirical method.

4.4 Comorbidity Analysis

We first study the ALRR. Recall that is calculated as follows.

If disease and disease are independent, then is close to . So if is large, then disease and disease are highly correlated.

If disease is high blood pressure and disease is type II diabetes. Then

Here is list of diseases with high LRR.

disease disease
Type II diabetes Type I diabetes 3.40
Pulmonary heart disease Acute ischemic heart disease 3.35
Type II diabetes atherosclerosis 3.18
Diseases of lip, tongue and oral mucosa
Malignant tumors of the lip,
mouth and pharynx
heart failure Arrhythmia 2.96
Metabolic disorders renal failure 2.84
emphysema asthma 2.80
pneumonia Bronchiectasis 2.67
high blood pressure Type II diabetes 2.47
high blood pressure renal failure 2.46
alopecia Seborrheic keratosis 2.41
heart failure Anal and rectal disorders 2.41
Type II diabetes high blood pressure 2.39
heart failure Peptic ulcer 2.38
high blood pressure atherosclerosis 2.36
Alzheimer disease Sleep disorders 2.33
high blood pressure
Cerebral hemorrhage or infarction
and its sequelae
Over nutrition Other diabetes 2.27
Pulmonary heart disease Arrhythmia 2.22
high blood pressure heart failure 2.21
Pituitary hyperfunction Joint disorder 2.12
Pulmonary heart disease arthritis 2.08
Table 5: ALRR of maximum entropy method

The following table is a list of diseases such that differs from .

disease disease
Female pelvic inflammatory disease Trichomoniasis 1.13 3.40
nephritic nephrotic syndrome heart failure 1.51 3.35
Metabolic disorders Malignant tumor of skin 1.37 3.19
Esophageal diseases Splenic diseases 3.40 1.58
anemia hypotension 2.89 1.30
Other diabetes Central nervous system diseases 1.26 2.84
Benign tumor of uterus
Tumors with undetermined or unknown
endocrine gland dynamics
2.89 1.31
Arrhythmia Mental and behavioral disorders 2.58 1.03
Arthrosis epilepsy 0.71 1.89
Arrhythmia Diseases of autonomic nervous system 2.63 1.47
Headache syndrome Other diseases of arteries and arterioles 1.58 2.74
Acute pancreatitis and other
diseases of pancreas
Type II diabetes 2.78 1.70
asthma emphysema 1.73 2.80
Ankylosis and other spondylosis Hypopituitarism 1.87 0.80
Arthrosis Myasthenia and primary muscle diseases 2.71 1.65
Malignant tumors of digestive organs Hemangioma and lymphangioma 3.40 2.33
Refractive and accommodative disorders glaucoma 2.41 1.35
Other diabetes Optic neuropathy 1.71 2.74
Chronic ischemic heart disease Pericardial disease 1.26 2.28
Type II diabetes Over nutrition 0.82 1.83
Table 6: Asymmetric ALRR of maximum entropy method

Next, we consider the -correlation. Recall that

Next table display 20 disease pairs with high -correlation.

disease disease
Mania, bipolar, depression, and
anxiety disorders
sleep disorder 68.23
Type II diabetes Hypertension and its complications 67.75
Headache syndrome
Pulp, gums and edentulous alveolar
ridge diseases
Arrhythmia Hypertension and its complications 58.55
Muscle disorders Backache 57.28
Shingles Dermatitis and pruritus 55.30
Headache syndrome Backache 53.33
Benign uterine tumor Abnormal uterine and vaginal bleeding 46.75
Upper respiratory tract diseases such as chronic
laryngitis and laryngotracheitis
Chronic rhinitis, nasopharyngitis
and pharyngitis
Other disorders of kidney and ureter Other disorders of the urinary system 42.30
Pulp, gums and edentulous alveolar
ridge diseases
Dermatophytes and other superficial
fungal diseases
Other disorders of male reproductive organs Prostatic hyperplasia and prostatitis 40.69
Urethral disorders Other disorders of the urinary system 39.66
Other disorders of bone Osteoporosis without pathological fracture 34.96
Upper respiratory tract diseases such as chronic
laryngitis and laryngotracheitis
Chronic bronchitis 33.39
Other diseases of the digestive system Gastritis and duodenitis 29.68
Type II diabetes Metabolic disorders 27.51
Type II diabetes Dermatitis and pruritus 27.16
Arrhythmia sleep disorder 27.11
Table 7: -correlation of maximum entropy method

We can see from table 5 many disease pairs with large , such as type II diabetes and hypertension and its complications, which imply that such diseases have intrinsic relations. Tedesco [3] have mentioned that Hypertension is frequently associated with diabetes mellitus and its prevalence doubles in diabetics compared to the general population. This high prevalence is associated with increased stiffness of large arteries. Our result is consistent with their medical research.

5 Conclusions

In this paper, we propose a maximum entropy method for predicting disease risks. It is based on a patient’s medical history with diseases coded in ICD-10 which can be used in various cases. The complete algorithm with strict mathematical derivation is given. We also present experimental results on a medical dataset, demonstrating that our method performs well in predicting future disease risks and achieves an accuracy rate twice that of the traditional method. We also perform a comorbidity analysis to reveal the intrinsic relation of diseases.


We would thank Franco Mueller, Jonathan Brezin and Matt Grayson for their collaboration on an early version of this research.


Proof of Theorem 3.1.

Suppose is the weight matrix. Then the entropy of the Markov chain can be rewritten as

Let us construct the Lagrangian

If , we have that

If , then . Therefore,

Set , then

By the Perron-Frobenius theorem(see e.g. [11]), There are no nonnegative eigenvectors for other than the Perron vector and its positive multiples. Hence,


On the other hand,

Therefore, there exists such that


Recall that

and . It follows that

Since , ,

Next, we will prove Theorem 3.2. We first prove an auxiliary lemma.

Lemma 1.

Suppose is the maximum eigenvalue of and are matrices in section 3.1. are the corresponding left and right eigenvectors with



Recall that


Here, denote the number of elements in .

Since , we have that

Since , we have that

The element of the left hand side in (1) is

Assume that


The element of the right hand side in (1) is

Thus, we complete the proof. ∎

Proof of Theorem 3.2.

Suppose is the weight matrix of , then

By the above lemma, we have that


Then are the left and right eigenvectors of corresponding to the eigenvalue . And



  • [1] Glasgow R.E. et al, Does the chronic care model serve also as a template for improving prevention, Milbank Q, 2001, 79(4):579-612.
  • [2] Cesar A. Hidalgo, Nicholas Blumm, Albert-Laszlo Barabasi, Nicholas A. Christakis, A Dynamic Network Approach for the Study of Human Phenotypes, PLoS Computational Biology, 2009, 5, e1000353.
  • [3] M. A. Tedesco, F. Natale, G. Di. Salvo, S. Caputo, M. Capasso & R. Calabro, Effects of coexisting hypertension and type II diabetes mellitus on arterial stiffness, Journal of Human Hypertension, 2004, 18:469-473.
  • [4] Roque, F. S. et al., Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Computational Biology, 2011, 7, e1002141.
  • [5] Melton, G. B. et al., Inter‑patient distance metrics using SNOMED CT defining relationships, J. Biomed. Inform., 2006, 39:697-705.
  • [6] Shannon, C.E., A Mathematical Theory of Communication, Bell System Technical Journal, 1948, 27:379-423.
  • [7] Maurer, A., Ockham’s Razor and Chatton’s Anti-Razor, Mediaeval Studies, 1948, 46:463-475.
  • [8] Perlis, R. H. et al. Using electronic medical records to enable large‑scale studies in psychiatry: treatment resistant depression as a model Psychol. Med., 2012, 42:41-50.
  • [9]

    Thomas E. Booth, Power Iteration Method for the Several Largest Eigenvalues and Eigenfunctions,

    Nuclear Science and Engineering, 2006, 154(1):48-62.
  • [10] Bruce Kitchens, Symbolic dynamics, one-sided, two-sided, and countable state Markov shifts, 1998, Springer.
  • [11] Carl D. Meyer, Matrix analysis and applied linear algebra. With solutions to problems, 2001, SIAM.