Log In Sign Up

Distributed representation of patients and its use for medical cost prediction

Efficient representation of patients is very important in the healthcare domain and can help with many tasks such as medical risk prediction. Many existing methods, such as diagnostic Cost Groups (DCG), rely on expert knowledge to build patient representation from medical data, which is resource consuming and non-scalable. Unsupervised machine learning algorithms are a good choice for automating the representation learning process. However, there is very little research focusing on onpatient-level representation learning directly from medical claims. In this paper, weproposed a novel patient vector learning architecture that learns high quality,fixed-length patient representation from claims data. We conducted several experiments to test the quality of our learned representation, and the empirical results show that our learned patient vectors are superior to vectors learned through other methods including a popular commercial model. Lastly, we provide potential clinical interpretation for using our representation on predictive tasks, as interpretability is vital in the healthcare domain


page 1

page 2

page 3

page 4


Medical Concept Representation Learning from Claims Data and Application to Health Plan Payment Risk Adjustment

Risk adjustment has become an increasingly important tool in healthcare....

In Vitro Fertilization (IVF) Cumulative Pregnancy Rate Prediction from Basic Patient Characteristics

Tens of millions of women suffer from infertility worldwide each year. I...

MedFACT: Modeling Medical Feature Correlations in Patient Health Representation Learning via Feature Clustering

In healthcare prediction tasks, it is essential to exploit the correlati...

Analytical review of medical mobile diagnostic systems

This article analyzes the mobile medical diagnostic systems and compare ...

Why Machine Learning Integrated Patient Flow Simulation?

Patient flow analysis can be studied from a clinical and or operational ...


Efficient representation of patients is very important in the healthcare domain and can help with many tasks such as medical risk prediction. Many existing methods, such as Diagnostic Cost Groups (DCG), rely on expert knowledge to build patient representation from medical data, which is resource consuming and non-scalable. Unsupervised machine learning algorithms are a good choice for automating the representation learning process. However, there is very little research focusing on patient-level representation learning directly. In this paper, we proposed a novel patient vector learning architecture that learns high quality, fixed-length patient representation from claims data. In addition, our model can learn meaningful medical visit representation and medical code representation at the same time. We conducted several experiments to test the quality of our learned representation, and the empirical results show that our learned patient vectors are superior to vectors learned through other methods. We also used our patient vector on a real-world application, and it outperforms a popular commercial model. Lastly, we provide potential clinical interpretation for using our representation on predictive tasks, as interpretability is vital in the healthcare domain111The codes for our model is avilable on github:


With the increasing adoption of electronic healthcare records (EHR), more healthcare and medical data are available digitally. The large amount of health data offers opportunities to apply machine learning methods in many predictive healthcare tasks. Some algorithms, such as logistic regression, typically prefer the input feature vectors to be small and efficient. One basic way to represent the medical coding data is through a bag-of-words (BOW) like approach: each medical code can be represented as a one-hot vector and each patient can be represented by aggregating the one-hot medical code vectors. However, representing the patients in such a high-dimensional and sparse vector will not only lose the temporal and co-occurrent information between medical visits and within medical codes, but also make it difficult for the machine learning algorithms to make a stable prediction without overfitting for different medical outcomes.

Medical concepts, especially medical codes, are not independent from each other. For example, Type 2 diabetes mellitus (ICD-9 250.00) is obviously more related to Type 1 diabetes mellitus (ICD-9 250.01) than Asthma, unspecified (ICD-9 439.00). Representing a medical code in a one-hot vector will not capture such relationships. To address this representation issue, medical experts developed many models that can group related medical codes for different purposes. For example, Clinical Classifications Software (CCS) is a tool developed for grouping related diagnosis and procedure codes into a manageable number (around 600) of clinically meaningful categories [1, 2, 3]. Diagnostic Cost Groups (DCG) group the diagnosis codes from medical risk perspectives, and it is able to calculate patients’ expected future medical costs based on diagnoses, age and gender [4, 5, 6, 7, 8]. DCG has been implemented by commercial vendors like Truven Health Analytics (Ann Arbor, Michigan). Although models based on domain knowledge provide us with a good way to capture the relationships between medical concepts, they are very resource consuming and labor-intensive to develop, maintain and update.

To overcome this limitation, many researchers utilized machine learning methods to learn efficient representations of medical concepts without relying on domain knowledge. One possible way is through supervised learning, and the representation will be learned as a “side effect.” Choi et al. 


built a recurrent neural network (RNN) model to predict patients’ future diagnoses. Baytas and colleagues 

[10] built a time-aware RNN model to handle time irregularities in medical visit sequences. Choi et al. [11]

used an interpretable attention based multi-layer perceptron (MLP) model to predict different medical outcomes. Ma et al. 

[12] developed a bidirectional attention RNN model to predict future medical codes. All these models are supervised, and efficient representations of patients are learned automatically to achieve a good predictive outcome. However, the quality of the representation is not the major focus for supervised learning tasks, and the representation learning process might be biased for certain kinds of predictive tasks.

Another common way to learn high quality representation is unsupervised learning. Choi et al. 

[13] and Choi et al. [14] developed models based on skip-gram to learn medical codes representation. Miotto et al. [15] and Baytas et al. [10] learned patient representation via auto-encoder (AE) based models. Skip-gram [18] based models are able to capture the co-occurrence information within medical visits. However, the existing skip-gram based models are only able to learn representation for medical codes and medical visits. In order to obtain patient representation, one needs to aggregate the code/visit vectors by averaging or adding, which will lose the temporal order of information in medical visits. RNN-AE based models [10] are able to capture the temporal relationship of the medical visits. However, RNN based models are notoriously hard to train when the length of medical sequences is long. In addition, results are not interpretable.
To address the aforementioned representation learning challenges in healthcare, we proposed Patient Vector (PV), an unsupervised framework that learns continuous distributed vector representation for patients. Our model is inspired by the recent work in learning paragraph vectors from English documents [16] and learning medical visit vectors from EHR [13]. Our main contributions are listed below:

  1. We proposed a novel architecture to learn patient representation that can be used for many predictive tasks including medical cost prediction. Moreover, our model is able to learn a meaningful representation for medical visits and medical codes.

  2. We demonstrated our learned patient representation improves predictive model performance in various healthcare tasks compared to other baselines. Additionally, our model outperforms another popular commercial model in practical use.

  3. We showed the potential interpretability of our learned representation by conducting several case studies and analyzing the clinical meaning of our representation.

Materials and methods


Our dataset contains medical claims between 2014 to 2016 for over 300,000 members from Partners for Kids (PFK), one of the largest Accountable Care Organizations (ACO’s) for low-income children in central and southeastern Ohio. In accordance with the Common Rule (45 CFR 46.102[f]) and the policies of Nationwide Children’s Institutional Review Board, this study used a limited data set and was not considered human subjects research and thus not subject to institutional review board approval.
We extracted members’ medical information from their medical claims including: 1. Visit level information, where each visit contains the claims within the same service date, and it includes medical codes (diagnosis, procedure, medication) and utilization (cost, place of service, category of the visit). 2. Individual level information (age, sex, annual cost). Note that the ”cost” we refer here and the rest of the paper denote the actual amount that is paid to the provider.
We removed members without continuous eligibility across 2014 to 2016 and members who have less than two medical visits during 2014 to 2015. This reduced the size of our cohort from over 300,000 to 107,060 members. We hereby refer to this group of members as patients. We acknowledge this eligibility requirement creates a biased sample as it removes those members that might be experiencing more instability (inferred by their inconsistent eligibility within these 3 years). However, without continuous eligibility, we cannot ascertain if the annual medical cost is accurate. For the purpose of evaluating the models, we have chosen to focus on this particular sample.
The detailed statistical information for our cohort is presented in Table 1

. We trained our model on 2014-2015 data and evaluated the trained embedding on 2016 data. The details of the evaluation metric will be provided in the following sections.

# of members 107,060
# of visits 1,547,471
Avg. # of visits per member 14.5
Avg. # of codes per member 37.7
# of unique medical codes 13,620
# of unique diagnosis codes 8,164
# of unique medication codes 339
# of unique procedure codes 5,117
Max # of codes per member 347
(95%, 99%) # of codes per member (81,118)
(95%, 99%) # of codes per visit (10,15)
Table 1: Statistics information of our dataset.

Data preprocessing

Due to the ICD-9 to ICD-10 conversion in Oct, 2015, there were inconsistent diagnosis codes in our dataset. To address the mapping inconsistency, we converted the ICD-10 codes to ICD-9 codes using the General Equivalence Mappings (GEMS) publicly available mapping table [17]. We ignored the codes that could not be mapped in GEMS.
Our dataset contains medical visits (claims for medical procedures) and retail pharmacy claims. In order to train all medical codes (diagnosis, procedure, medication) under the same latent space, we needed to combine the pharmacy claims and medical visits. To do so, we used the service date to track member’s medical visits. Since the service dates for pharmacy claims could be a few days after the corresponding medical visit, we combined the pharmacy claim and medical visit if the pharmacy claim occurred within two weeks after a medical visit. We ignored the pharmacy claims if there was no medical visit before 2 weeks.
Lastly, as some members have more than one insurance, their medical claim might have been paid by another insurer, resulting in a non-positive paid amount in some instances. This occurred in approximately 4% of such medical claims in our dataset, and we converted the paid amount in such claims to zero.

Basic Notation

We denote all the unique medical codes from the dataset as , where is the vocabulary of all medical codes. Claims data for each patient contains medical visits , where contains a subset of medical codes and ordered by time-stamp . Initially, we represent each patient by a count vector , where the index is if contains code times across all medical visits, visit is represented by a binary vector , where column is 1 if contains code . Our dataset contains three types of medical codes: diagnosis, encoded by International Classification of Diseases (ICD); procedure, encoded by Current Procedural Terminology (CPT); and medication, encoded by drug classes.

Learning from code-level co-occurrence information

Medical codes within each medical visit contain co-occurrence information: related medical codes are likely to share similar set of codes as context [13, 14]. One common way to capture such relationship is via Skip-gram, a model proposed by Miklov et al. [18] that is able to capture the syntactic and semantic relationships of English words. The main idea of Skip-gram is to use a word to predict its neighboring words in a sentence. By doing so, words sharing similar context will have a similar representation in the latent space, we can also use the Skip-gram model to capture the syntactic and semantic relationship of medical codes, as done by Choi et al. [14] and Choi et al. [13]. More formally, given a sequence of medical codes within a given medical visit, our objective function is to maximize the average log likelihood:

, where

is the context window size and the probability is computed using softmax function:

, where is the representation for code . Note that we do not distinguish between the “input” and “output” medical code as suggested by Choi et al. [13] considering the unordered structure of medical codes within the medical visit.

Learning from visit-level sequential information

In Med2vec [13], the model proposed by Choi et al., visit vectors are formed by summing the code vectors and used to predict the codes in neighbouring visits in order to capture the sequential information between medical visits. Similar to Med2vec, we created the same medical visits prediction task but with a small modification. Inspired by the idea from Paragraph-vector architecture [16], we created an additional patient vector for each patient and asked it to contribute to predicting the medical codes within the neighboring medical visits. The patient vector can be thought of as the generalization of patient’s health condition. It acts as a memory that remembers the patient’s overall medical history, and contributes to the prediction of surrounding visits. The patient vector helped the current visit to remember what else it needs when predicting the surrounding visits. The patient vector is shared across all medical visit prediction tasks for the same patient.
More formally, as shown in Fig 1, given a patient and his/her medical visits , the count patient vector is converted into an intermediate patient representation and the binary visit vector is converted into an intermediate visit representation via the following equations:

, where , are the code weight matrix and bias. Then we concatenate the patient-level demographic information and visit-level demographic information separately to create the final patient representation and visit representation as follows:

, where , are the visit weight matrix and bias, ,

are the patient weight matrix and bias. We use ReLU as the activation function. Lastly, we concatenate the patient representation and visit representation to predict the medical codes of the visits within a fixed context window via a softmax model:

, where , are the output weight matrix and bias. Our objective is to minimize the following cross entropy function:

Fig 1: Patient Vector model architecture. The training objectives are to: 1) learn medical code representation that is good at predicting neighboring codes within the same visit. 2) learn medical visit representation and patient representation that are good at predicting nearby visits.

Unified training

To capture both code-level co-occurrence information as well as visit-level sequential information, we combined the aforementioned two objective functions together using a hyper-parameter as follows:

From classification to ranking

One drawback of our Patient Vector model is the big softmax matrix

at the output layer. To alleviate the curse of dimensionality when training the softmax classifier, Med2vec 

[13] utilized the hierarchical structure of medical codes and trained the classifier to predict the grouped medical codes instead of the exact medical codes. By doing so, the final output space can be significantly reduced from around 20,000 to 2,000 dimensions. However, the output dimension is still very high and the curse of dimensionality problem remains unsolved.
To better alleviate this issue, we proposed a different objective function denoted as score ranking function. We denote the Patient Vector model with score ranking objective function as Patient Vector+, as shown in Fig 2.

Fig 2: Patient Vector+ model architecture. The training objective is: 1) learn medical code representation that are good at predicting codes within the same visit. 2) learn medical visit representation and patient representation that will give high score for codes within nearby visits.

In the Patient Vector+ framework, instead of calculating the probability of all possible medical codes in the neighbor visits, we randomly select negative codes (i.e., codes that are not in the neighbor visits) and train the model to assign low scores for these negative codes, while for positive codes (codes that are in the neighbor visits), the model should assign high scores.
More formally, we will use the new objective function as below:

, where

Implementation detail

All the approaches were implemented using Python and Tensorflow with the dataset randomly divided into training, validation and testing set in 7:1:2 ratio by patient. For the training process, we set minibatch size = 100, visit window size

, patient embedding size={100,200}, visit embedding size={100,200} and code embedding size={100, 200}. We performed 40 training epochs over the training dataset, and select the hyperparameters based on the model performance on validation dataset. We used early stopping method on validation dataset to prevent overfitting, and report the model performance on testing dataset.

Results and discussion

Patient representation evaluation

To evaluate the quality of our patient representation, we first compared our model with other baselines for two different predictive tasks: current year cost prediction and next year cost prediction. Then, we applied our model to a real-world application: high medical risk patient selection, and compared our model with DCG, a popular commercial model.

  • Task 1: Current year cost prediction. The medical cost of current year is based on patients’ medical utilization information including medical procedures, medical utilization and medication usage. One way to evaluate the learned patient representation is to use it for predicting the current year’s medical cost. In our current cost prediction task, we used the learned patient vector as the input feature and current medical cost (from 2014 to 2015) as the output label to train a regression model.

  • Task 2: Next year cost prediction. The next year medical cost of patients is very important for hospitals and ACO’s. With more accurate prediction of future medical cost, hospitals and ACO’s can do better financial forecast and financial risk management. In our future cost prediction task, we used learned patient vector and previous medical cost (from 2014 to 2015) as input features, and the future medical cost (in 2016) as the output label to train a regression model.

  • Task 3: High medical risk patient selection. With the shift to managed healthcare for patients, ACO’s and providers are responsible for managing the overall health of a patient. Many hospitals and ACO’s have care coordination programs aimed to improve health outcomes and reduce medical costs for patients with high medical risk. The DCG model is a population-based classification and risk adjustment methodology that is widely used by organizations and hospitals to evaluate a patient’s future medical risk. The DCG model assigns a score for each patient with a higher score indicating higher predicted medical costs in the next year. In the high medical risk patient selection task, we compared patients with the highest predicted medical cost by DCG and our model.

To compare our model with other baselines, we used the Coefficient of determination (

) and Root Mean Squared Error (RMSE) evaluation metrics for task-1 and task-2. Note that we calculated the below two values on the log-scaled medical cost in order to normalize the highly skewed distribution of medical costs, as suggest by Diehr et al 


. Five different random seed were randomly seleted and used to split the dataset. We reported the average results for each model, and denote standard deviations within the parentheses.

  • Coefficient of determination ()

    is the standard approach for the evaluation of regression models. It represents the proportion of the variance in the output value that is predictable from the input variables. The best possible value for

    is 1, when the output value can be predicted 100% from the input variables.

  • Root Mean Squared Error (RMSE) is also a commonly used measurement for regression models. It measure the differences between predicted values and the true values.

The following baseline methods are used in comparison.

  • Count vector model: Patients are represented by a count vector, where element at position represents the number of occurrences for the code in vocabulary.

  • Stacked denoising auto-encoder (SDA): A three layer SDA model to learn the patient representation as described by Miotto et al. [15].

  • Sum skip-gram vectors (Skip-gram): Patients are represented by aggregating their medical code vectors, which are learned by the skip-gram model as described in previous studies [13, 14]

  • Sum Med2vec visit vectors (Med2vec) Patients are represented by aggregating their visit vectors, which are learned by the Med2vec [13] model.

  • Concatenation of visit and code vectors (Med2vec+): To enhance the previous two baselines, we concatenated the visit and code embeddings to form patient embedding.

Table 2 shows the evaluation results for different patient embedding learning approaches. Our patient vector models generally outperforms the other models. In particular, the Patient Vector+ model consistently performed the best out of all the models evaluated. Raw count vector model performed worst for cost prediction tasks as the input feature space is too high for a regression model to make a meaningful prediction.

Task-1 Result
Task-2 Result
Raw count vector 15.66%(0.0857), 1.086(0.0711) 15.03%(0.0231), 1.765(0.0605)
SDA 28.47%(0.0022), 1.071(0.0066) 26.09%(0.0037), 1.650(0.0134)
Skip-gram 42.36%(0.0052), 0.900(0.0054) 26.44%(0.0034), 1.645(0.0122)
Med2vec (13) 58.22%(0.0056), 0.766(0.0064) 28.39%(0.0045), 1.623(0.0119)
Med2vec+ 61.11%(0.0090), 0.739(0.0061) 28.66%(0.0041), 1.620(0.013)
Patient Vector 57.34%(0.0048), 0.780(0.0045) 28.55% (0.0049), 1.622 (0.0134)
Patient Vector+ 66.75%(0.0048), 0.684(0.0075) 29.88%(0.0035), 1.611(0.0122)
Table 2: Evaluation of patient embedding on two medical risk prediction tasks. Model with higher value and lower RMSE value indicate the model is better fitting the task.

Fig 3 shows the empirical results for the high medical risk patient selection task. We selected patients with the top 0.5%, 1%, 5%, 10% and 50% highest predicted medical cost by DCGs and our model. Selected patients with higher future medical cost (in 2016) indicate a better fitting model. For the top 0.5% predicted high risk patient group, patients selected by the Patient Vector models had much higher medical costs in the predicted year (in 2016), indicating our model was able to more accurately identify the subset of the most expensive patients. These high cost patients are likely to spend lots of medical resources in the future. Prospective identification of these patients will allow hospitals and ACO’s to provide interventions such as care coordination.

Fig 3: High cost patient selection by three different models. Model with higher value on y-axis indicate the model is better at selecting future high cost patients.

Visit representation evaluation

To evaluate the quality of our visit representation, we used our learned visit representation for similar tasks as patient representation evaluation tasks: current visit cost prediction and next visit cost prediction.
Table 3 shows the evaluation results for visit embedding. Med2vec [13] model is doing better than our methods in current visit cost prediction task and worse than our model in next visit cost prediction task. From these two tasks, we know that our methods are able to learn efficient representation for medical visits compared to the previous state-of-art medical visit representation learning method.

Current visit cost prediction
Next visit cost prediction
Med2vec (13) 41.38%, 0.930 27.19%, 1.046
Patient Vector 38.85%,0.949 27.29%, 1.044
Patient Vector+ 38.81%,0.955 28.91%, 1.033
Table 3: Evaluation of visit embedding. Model with higher value and lower RMSE value indicate the model is better fitting the task

Code representation evaluation

Evaluation by existing medical grouper. We evaluated the quality of our learned diagnosis code vectors via an existing medical grouper: Clinical Classifications Software (CCS). CCS is a medical grouper that can group diagnosis codes into around 300 different categories based on expert opinion: codes within the same categories are believed to have certain relationship with each other. We plot all the diagnosis codes in Fig 4 using t-SNE projection, and highlighted the diagnosis codes that belong to the same CCS category in Diamond with the same color, for a subset of the CCS categories. Diagnosis codes from the remaining categories are shown in dim round circle.

Fig 4: Diagnosis in t-SNE projected vector space. Diagnosis code belong to the same CCS category are closed to each other, indicating the grouping the codes similar to the medical experts.

As shown in Fig 4

, most of the diagnosis codes that belong to the same CCS category are likely to be grouped in the same area, such as Asthma (CCS 128) and Ear disorders (CCS 94). However, not all diagnosis codes that belong to the same CCS category are well grouped. For example, we can see some superficial injury (CCS 239) related codes that are located far away from the majority codes. This, according to our observation, is because of two reasons: 1) Artifact caused by dimensionality reduction algorithm. For some separated codes, such as some of the superficial injury related codes we mentioned above, they are actually very close to each other in the original embedding space if we use cosine similarity to measure the distance. It is the t-SNE algorithm that separate them away from the majority. 2) There are some other hidden relationship that is captured. One major idea of our embedding learning algorithm is to capture the co-occurrence information within medical visit: codes with similar neighbors are more close to each other. Hence the learned code representation does not necessarily follow the CCS categorization. For example, Essential hypertension (CCS-98) and Diabetes mellitus (CCS 49) belong to different CCS categories, but since they often appear together within the same medical visit along with similar diabete/hypertension related procedures and medications, their learned embedding vectors will be close to each other according to our algorithm, which also make sense from the clinical perspective.

Medical groupers such as CCS can be used to evaluate the performance of the learned medical vectors as it can give us a good sense of the overall quality of the embeddings. Based on the Fig 4, we believed our learned code representation is in high quality.

Interpretability Analysis

Interpretation. Interpretability is very important in healthcare domain. As used in many medical representation learning related papers, we interpret the clinical meaning of each dimension of the learned medical code vectors by selecting the top eight codes that have the largest values for each dimension [13][12]. More formally, to evaluate the clinical meaning of the column of the code vector, we selected medical codes via the following equation:

, where represents the column of the code embedding matrix.
Together with analysing the coordinate of code embedding, we are also interested in finding the most influential factors used by the predictive model to make predictions. We analyze the regression model and the clinical meaning of embeddings to find out which code coordinate plays an important role in prediction of current annual cost. We used the analytical method proposed by Choi et al. [19]. The idea is to find out the code coordinates that can maximizes the output activation via the following equation:

, where is the weights of the regression model and we are using broadcasting addition for .
Using the above strategy, we selected the top two coordinates that have the strongest influence and their corresponding eight codes for the current cost prediction, as shown in Table 4. For coordinate 128, it groups medical codes that are related to emergency services and accidents that cause wounds. Coordinate 134 is related to acute diseases such as vomiting and fever, radiology examination and emergency visits. The medical codes associated with the two coordinates are obviously very expensive for children and hence it make sense for our regressor to assign more weights to these two coordinates. This computational result also confirms several currently used strategy to control the cost of pediatric ACO’s: 1) to proactively reduce the usage of emergency rooms, 2) to prevent home accidents, and 3) to control operating room cost.

Coordinate 128 Coordinate 134

Table 4: Medical codes with the maximum value in each coordinate.

Patient Similarity. Once we have learned the patient representation, we can calculate the distance between patients. To analyze whether it makes sense to evaluate the patient similarity via patient embedding, we performed two clustering tasks. Firstly we selected 500 patients with three different diseases (asthma, depression, seizures). Since the three diseases have distinct etiology, patients with these diseases are likely to be separated in their clinical representation. Secondly, we selected 1,500 patients that have either the highest medical cost or the lowest medical cost in the subsequent year. We hypothesize that patients with high future medical cost would have a more complex medical condition than patients with low future medical cost, and the difference could be captured by patient embedding.
As shown in Fig 5, there are obviously three clusters for patients with different diseases. The clusters are not perfectly separated, this is likely because patient usually have more than one diseases, which will complicated their health condition and affect the embedding vectors.

Fig 5: Patient representation plot for patients with three different diseases.

We can obviously observe two clusters for patients with different future medical cost from Fig 6. Most of the high future cost patients are clustered tightly with each other. In contrast, patients with low future cost are loosely spread on the plot with some mixing with the high future cost patients. We postulate that this is due to the heterogeneous nature of acute conditions frequently encountered in children. On the other hand, patients with the highest future cost are more likely to have chronic, severe diseases, that require long-term extensive medical care.

Fig 6: Patient representation plot for patients with different future medical cost.

Limitation and future studies

The model we proposed in this study is data-driven and excluded expert knowledge, this is a limitation under the current scope as taking into account domain knowledge is likely to improve the predictive power of healthcare models [20, 21, 22]. Another limitation is we didn’t consider the heterogeneity (i.e. different subgroups might have different relationship between variables, like diagnosis, and medical cost). ”Average effects” might lead to an inaccurate prediction [23, 24].

Although the only predictive output of our model is the paid amount to the healthcare provider, we do understand that it is not the only valuable outcome. Our future work will focus not only on the medical cost, but also the quality and necessity of the medical care.


In this paper, we propose a novel architecture to address the challenges of modeling patient representation from medical data. By utilizing a multilayer neural network, we built an unsupervised learning algorithm that learns patient representations that can capture both medical code co-occurrence information and medical visit sequential information. Our learned patient embeddings show superior predictive power when compared with existing patient representation methods and also a commercially used model developed by medical experts. In addition, we also demonstrated the clinical interpretability by applying patient embedding to a real world predictive task.


We would like to thank Jennifer Klima and Brad Stamm from Partners For Kids (PFK) for providing the medical claims data for this study and for valuable discussions. We also thank Stephen Cardamone and Deena Chisolm for providing useful feedback to the manuscript.


  •  1. Elixhauser A. Clinical Classifications for Health Policy Research, Version 2: Hospital Inpatient Statistics. 1996. 180 p.
  •  2. Elixhauser A, Steiner CA. Hospital Inpatient Statistics, 1996: Tools for Decisionmaking & Research. 1999. 69 p.
  •  3. Elixhauser A. Clinical Classifications for Health Policy Research: Hospital Inpatient Statistics. 1998. 177 p.
  •  4. Ash AS, Zhao Y, Ellis RP, Schlein Kramer M. Finding future high-cost cases: comparing prior cost versus diagnosis-based methods. Health Serv Res. 2001 Dec;36(6 Pt 2):194–206.
  •  5. Cowen ME, Dusseau DJ, Toth BG, Guisinger C, Zodet MW, Shyr Y. Casemix adjustment of managed care claims data using the clinical classification for health policy research method. Med Care. 1998 Jul;36(7):1108–13.
  •  6. Ellis RP, Ash A. Refinements to the Diagnostic Cost Group (DCG) model. Inquiry. 1995;32(4):418–29.
  •  7. Report F. Diagnostic Cost Group Hierarchical Condition Category Models for Medicare Risk Adjustment.. Health Economics Research, 2000.
  •  8. Rosen AK, Loveland SA, Anderson JJ, Hankin CS, Breckenridge JN, Berlowitz DR. Diagnostic cost groups (DCGs) and concurrent utilization among patients with substance abuse disorders. Health Serv Res. 2002 Aug;37(4):1079–103.
  •  9. Choi E, Bahadori MT, Schuetz A, Stewart WF, Sun J. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. JMLR Workshop Conf Proc. 2016 Aug;56:301–18.
  •  10. Baytas IM, Xiao C, Zhang X, Wang F, Jain AK, Zhou J. Patient Subtyping via Time-Aware LSTM Networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17
  •  11. Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism. Advances in Neural Information Processing Systems 29. Curran Associates, Inc.; 2016. p. 3504–12.
  •  12. Ma F, Chitta R, Zhou J, You Q, Sun T, Gao J.

    Dipole: Diagnosis Prediction in Healthcare via Attention-based Bidirectional Recurrent Neural Networks.

    In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’17, 1903-1911.
  •  13. Choi E, Bahadori MT, Searles E, Coffey C, Thompson M, Bost J, et al. Multi-layer Representation Learning for Medical Concepts. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16.
  •  14. Choi Y, Chiu CY-I, Sontag D. Learning Low-Dimensional Representations of Medical Concepts. AMIA Jt Summits Transl Sci Proc. 2016 Jul 20;2016:41–50.
  •  15. Miotto R, Li L, Kidd BA, Dudley JT. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records. Scientific reports 6 (2016): 26094.
  •  16. Le QV, Mikolov T. Distributed Representations of Sentences and Documents. In International Conference on Machine Learning, 2014, pp. 1188-1196.
  •  17. Butler R. The ICD-10 General Equivalence Mappings. Bridging the translation gap from ICD-9. J AHIMA. 2007 Oct;78(9):84–5.
  •  18. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed Representations of Words and Phrases and their Compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in Neural Information Processing Systems 26. Curran Associates, Inc.; 2013. p. 3111–9.
  •  19. Diehr P, Yanez D, Ash A, Hornbrook M, Lin DY. METHODS FOR ANALYZING HEALTH CARE UTILIZATION AND COSTS. Annu Rev Public Health. 1999;20(1):125–44.
  •  20. Perzynski, Adam T and Rothberg, Michael B and Dawson, Neal V and Coulton, Claudia J and Dalton, Jarrod E. Accuracy of Cardiovascular Risk Prediction Varies by Neighborhood Socioeconomic Position. Annals of internal medicine. 2018;168(9):681–682.
  •  21. Berry, Jay G and Hall, Matt and Neff, John and Goodman, Denise and Cohen, Eyal and Agrawal, Rishi and Kuo, Dennis and Feudtner, Chris. Children with medical complexity and Medicaid: spending and cost savings. Health Affairs. 2014;33(12):2199–2206.
  •  22. Agrawal, Rishi, Matt Hall, Eyal Cohen, Denise M. Goodman, Dennis Z. Kuo, John M. Neff, Margaret O’Neill, Joanna Thomson, and Jay G. Berry. Trends in health care spending for children in Medicaid with high resource use. Pediatrics. 2016: e20160682
  •  23. Kravitz, Richard L., Naihua Duan, and Joel Braslow. Evidence based medicine, heterogeneity of treatment effects, and the trouble with averages. The Milbank Quarterly. 2004; 82(4): 661-687.
  •  24. Hayward, Rodney A., David M. Kent, Sandeep Vijan, and Timothy P. Hofer. Reporting clinical trial results to inform providers, payers, and consumers. Health Affairs. 2005;24(6):1571-1581.