Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record
The wide implementation of electronic health record (EHR) systems facilitates the collection of large-scale health data from real clinical settings. Despite the significant increase in adoption of EHR systems, this data remains largely unexplored, but presents a rich data source for knowledge discovery from patient health histories in tasks such as understanding disease correlations and predicting health outcomes. However, the heterogeneity, sparsity, noise, and bias in this data present many complex challenges. This complexity makes it difficult to translate potentially relevant information into machine learning algorithms. In this paper, we propose a computational framework, Patient2Vec, to learn an interpretable deep representation of longitudinal EHR data which is personalized for each patient. To evaluate this approach, we apply it to the prediction of future hospitalizations using real EHR data and compare its predictive performance with baseline methods. Patient2Vec produces a vector space with meaningful structure and it achieves an AUC around 0.799 outperforming baseline methods. In the end, the learned feature importance can be visualized and interpreted at both the individual and population levels to bring clinical insights.READ FULL TEXT VIEW PDF
We show how to learn low-dimensional representations (embeddings) of pat...
Chronic kidney disease (CKD) is a gradual loss of renal function over ti...
Healthcare professionals have long envisioned using the enormous process...
Researchers require timely access to real-world longitudinal electronic
Converting electronic health record (EHR) entries to useful clinical
Interpretability allows the domain-expert to directly evaluate the model...
Electronic Health Record (EHR) data has been of tremendous utility in
Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record
Longitudinal EHR data resemble text documents from many perspectives. A text document consists of a sequence of sentences, and a sentence is a sequence of words. Similarly, the longitudinal health record of a patient consists of a sequence of visits, and there is a list of clinical events, including diagnoses, medications, and procedures, that occur during a visit. Considering these similarities, representation learning methods for text documents in Natural Language Processing (NLP) have great potential to be applied to longitudinal EHR data.
Deep neural networks have become very popular in the NLP field and have been very successful in many applications, such as machine translation, question answering, text classification, document summarization, language modeling, etc.[1, 2, 3, 4, 5, 6, 7, 8]. These networks excel at complex language tasks because they are capable of identifying high-order relationships, the network structure can encode language structures, and they allow the learning of a hierarchical representation of the language, i.e., representations for tokens, phrases, and sentences, etc.
In the medical domain, it is critical that analytical results are interpretable, so that they can be understood and validated by a human with expert knowledge and so that knowledge captured by analysis can be used for process improvement. Traditional deep neural networks have the disadvantage that they lack interpretability. A substantial amount of work is ongoing to make sense of the “black box”, and the attention mechanism  is one of the more effective methods recently developed to make the output of these algorithms more interpretable.
Health care is undergoing unprecedented change, and there is a great potential and demand for personalized care strategies. Personalized medicine, also called precision medicine, has previously focused on optimizing therapy to better fit the genetic makeup of the patient or the disease (e.g., the genetic susceptibility of cancer to specific chemotherapy strategies). The availability of EHR data and advances in machine learning create the potential for another type of personalization of healthcare. This type of personalization has become ubiquitous in our daily life. For example, customers have come to expect personalized search on Google and personalized product recommendations on Amazon and Netflix, based on their charactersitics and previous experiences with the systems. Personalization of healthcare processes, based on a patient’s phenotype (physical and medical characteristics) and healthcare experiences as documented in the health record, may also improve ”customer” satisfaction and it has the additional potential to improve healthcare efficiency, lower costs, and yield better outcomes. We believe that representation learning methods can capture a personalized representation of the important heterogeneities in patients’ phenotypes and medical histories at the population-level, and make these representations available to drive healthcare decisions and strategies.
This research is based on RNN models and the attention mechanism with the objective of learning a personalized, interpretable, and complete representation of patients’ medical records. Our proposed framework is capable of learning a personalized representation for each patient from a sequence of clinical events. A hierarchical attention mechanism learns personalized weights of clinical events, including hospital visits and the procedures that they contain. These weights allow us to interpret the relative importance and roles of clinical events in the learned representations both at individual and population levels. The ultimate goal is more accurate prediction and better insight into the critical elements of healthcare processes that can be used to improve healthcare delivery.
The rest of this paper is organized as follows: Section II summarizes the variants of RNNs and the attention mechanism, as well as their application to EHR data. Section III presents an overview of the proposed Patient2Vec representation learning framework, and Section IV elaborates the details of the algorithms. In Section V, the proposed framework is evaluated for a prediction task and we compare its performance with other baseline methods. In addition to prediction performance, we further interpret the learned representations with visualizations on example patients and events. Finally, Section V provides a summary of this work.
In this section, we present an overview of a gated recurrent unit, a type of RNN, which is capable of capturing long-term dependencies. Then we briefly introduce attention mechanisms in neural networks that allow the network to attend to certain regions of data, which is inspired by the visual attention mechanism in humans. Additionally, we summarize the RNN networks and attention mechanisms previously used to mine EHR data.
RNNs are expected to learn long-term dependencies by taking the previous state and the new input in the computation at the current time step
. However, vanilla RNNs are incapable of capturing the dependencies when the sequence is very long due to the vanishing gradient problem
. Many variants of the RNN network have been proposed to address this issue, and long short term memory (LSTM) is one of the most popular models used nowadays in NLP tasks[13, 14, 7, 8, 15, 16]. [h](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/LSTM_GRU.pdfThe top figure is a GRU gating unit and bottom figure shows an LSTM unit 
GRU is a simplified version of LSTM . The basic idea of GRU is to combat the vanishing gradient problem with a gating mechanism. Hence the general recurrent structure in GRU is identical to vanilla RNNs except that a GRU unit is used in the computation at each time step rather than a traditional simple recurrent unit.
In general, a GRU cell has two gates, i.e., a reset gate and an update gate . The reset gate is used to determine how to integrate the previous state into the computation of the current state, while the update gate determines how much the unit updates its activation.
Given the input at time step , the reset gate is computed as presented in Equation 1
where and are the weight matrices of the reset gate and is the hidden activation at time step . A similar computation is performed for the update gate at time step , shown in Equation 2
where and are the weight matrices of update gate. The current hidden activation is computed by
where is the candidate activation at time step . The computation of is presented in Equation 4
GRU is capable of learning long-term dependencies  due to the additive component of update from to in the gating mechanism. Consequently, important features will be carried forward in the input stream while irrelevant information will be dropped. When the reset gate is
, the network is forced to drop previous states and reset with current information. Moreover, the method provides shortcuts such that the error is easily backpropagated without vanishing too quickly[5, 18]. Hence, the GRU is well-suited to learn long-term dependencies in sequence data.
An LSTM unit is similar to a GRU, but with one more gate in an LSTM unit (as shown in Figure II-A). LSTM also preserves long term dependencies more effectively than basic RNN. This is particularly useful to overcome the vanishing gradient problem . Although LSTM has a chain-like structure similar to RNN,
LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state. Figure II-A shows the basic cell of an LSTM model. A step by step explanation of an LSTM cell is as following:
Candid memory cell value:
Forget gate activation:
New memory cell value:
Output gate value:
In the above description all
represent bias vectors, allrepresent weight matrices, and is used as input to the memory cell at time . Also,the indices refer to input, cell memory, forget and output gates respectively. An RNN can be biased when later words are more influential than the earlier ones.
Empirically, LSTM and GRU achieve comparable performance in many tasks but there are fewer parameters in a GRU, which makes it a little faster to learn and able to generalize with fewer data .
Attention mechanisms, inspired by the visual attention system found in humans, have become popular in deep learning. Attention allows the network to focus on certain regions of data, while perceiving other regions with “low resolution”. In addition to higher accuracy, it also facilitates the interpretation of learned representations. We elaborate an attention mechanism on an RNN network, and Figure II-B presents a graphical illustration.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/Global_attention.pdfThe global attention model
According to Figure II-B, a variable-length weight vector is learned based on hidden states . Then a global context vector is computed based on weights and all the hidden states to create the final output. Equation 11 presents the computation of the weight vector , where is the length of the sequence
is a nonlinear activation function, usuallyor . Then, the context vector is constructed as:
Thus, the network puts more attention on the important features for the final prediction which can improve the model performance. An additional benefit is that the weights can be utilized to understand the importance of features such that the models are more interpretable. The attention mechanism has been introduced to both Convolutional Neural Networks (CNNs) and RNNs for various tasks and has achieved many successes in the fields of computer vision and NLP[11, 21, 22].
Previous studies on EHR data mainly use statistical methods or traditional machine learning techniques. Recently researchers have started adapting deep learning approaches to this data [23, 24], including textual notes, temporal measurements of laboratory testing in the Intensive Care Unit (ICU), and longitudinal data in patient populations. Here, we summarize deep learning research in mining EHR data and focus on the studies using RNN-based models.
Hospitalized patients, especially patients in ICUs, are continuously monitored for cardiac, respiratory, and other physical functions, creating a large volume of sequential data in multiple dimensions. These measurements are utilized by physicians to make diagnostic and treatment decisions. The functions monitored may change over time and monitoring may be irregular, based on a patient’s condition. It is very challenging for traditional machine learning methods to mine this multivariate time series data considering missing values, varying length, and irregular, non-simultaneous sampling. Lipton et al. 
trained an LSTM with a replicated target to learn from these sequence data and used this model to make predictions of diagnoses. The data used in this research are time series of clinical measurements with continuous values, and the LSTM models outperformed logistic regression and MLP.Che et al.  developed a GRU-based model to address missing values in multivariate time series data, in which the missing patterns are incorporated for improved prediction performance. This work has been applied to the Medical Information Mart for Intensive Care III (MIMIC-III) clinical database to demonstrate its effectiveness in mining time series of clinical measurements with missing values . Longitudinal EHR data including clinical events, such as diagnoses, medications, and procedures is also a potentially rich resource for predictive modeling. Choi et al.  analyze this data with a GRU network to forecast future clinical events, and it achieves a better prediction performance than comparison models such as logistic regression and MLP.
Difficulty in interpreting model behavior is one of the major drawbacks of using deep learning to mine EHR data. Some attempts have been made to address this issue. Che et al. 
propose an interpretable mimic learning method which trains a mimic gradient boosting trees model to utilize predicted labels or features learned by deep learning models for final prediction. Then the feature importances learned by the tree-based models are used for knowledge discovery. Attention mechanisms have been introduced recently to improve the interpretability of the prediction results of deep learning models in health analytics. Choi et al.  develop an interpretable model with two levels of attention weights learned from two reverse-time GRU models, respectively. The experimental results on EHR data indicate comparable prediction performance with conventional GRU models but more interpretable results. Our work continues the attempt to use attention mechanisms to improve the interpretability of RNN-based models.
In this section, we provide an overview of the proposed hierarchical representation learning framework. This framework uses deep recurrent neural networks to capture the complex relationships between clinical events in the patient’s EHR data and employs the attention mechanism to learn a personalized representation and to obtain relative feature importance. The proposed representation learning framework contains four steps and is presented graphically in Figure III.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/figure5_system.pdfThe Patient2Vec representation learning framework
EHR data consists primarily of records of outpatient and inpatient visits to healthcare providers. These visit records include multiple clinical codes for diagnoses, symptoms, procedures, therapies, and other observations and events that occurred during the visit. Here, we treat the set of medical codes associated with a visit as a sentence consisting of words, except that there is no ordering in the words. Thus, we adopt the word2vec approach to construct a vector to represent each medical code.
Clinical visits are represented as the set of vectors for the codes associated with the visit. Because closely-spaced visits are usually related clinically, we employ a time window to split the sequence of visits into multiple subsequences of equal length. A subsequence might contain multiple visits if they occurred within the same time window, or there might be no visits during a particular time window yielding an empty subsequence. Thus we transform the original sequence of irregularly-spaced visits into a sequence of subsequences with equal intervals, which is preferable for recurrent neural networks. The width of the subsequence window defines the time granularity of the method and its optimal width is related to the acuity (i.e., stability) of the clinical characteristics involved in the predication task. In future work it may be possible to define the relationship between clinical acuity and optimal subsequence width, or develop methods for learning an optimal width for a defined prediction task.
Because all medical events occurring within a subsequence are unlikely to contribute equally to the prediction of the target outcome, we cannot aggregate them with equal weights. Instead, we employ a self-attention mechanism which trains the network to learn the weights.
Given a sequence of subsequences with embedded medical codes, we are able to input it into a recurrent neural network to capture the temporal dependencies between events. However, the subsequences of visits are not contributing equally to the outcome. Hence, we employ another level of attention to learn the weights of the subsequences by the network itself for the outcome prediction.
Given the learned weights and hidden outputs, we aggregate them into one universal vector for a comprehensive representation. In this step, the static information, such as age, gender, previous hospitalization history is added as extra features, to get a complete representation of a patient.
Given the complete vector representation of a patient’s EHR data, we add a logistic regression layer at the end for the prediction of outcome.
In this section, we present the details of the proposed representation learning framework, which is based on a GRU network and a hierarchical attention mechanism. Figure 1 presents the structure of the proposed network with attention.
The proposed framework consists of five parts presented in the following: I) Learning vector representations of medical codes, II) Learning within-subsequence self-attention, III) Learning subsequence-level self-attention, IV) Constructing aggregated deep representation, V) Predicting outcome.
Given a patient’s raw EHR data, a sequence of visits, we observe that a visit usually contains multiple medical codes. Hence, it is feasible to learn a vector to represent the medical code by capturing the relationships between the codes. In this work, we employ the classical word2vec algorithm, skip-gram. The basic idea of skip-gram is to learn a vector to represent each word such that the probability of the context to predict based on the target word is maximized. Hence, the vectors of similar words are close to each other in the learned feature space. In the skip-gram model, the vectors are learned by training a shallow neural network to predict the context words given an input word. Similarly, in our problem, the input is a medical code and the target to predict are the medical codes occurred in the same visit.
Hence, each subsequence is a matrix consisting of the vectors of medical codes occurred during this associated time window.
Given a sequence of subsequences encoded by vectors of medical codes, this step employs the within-subsequence attention which allows the network itself to learn the weights of vectors in the subsequence according to its contribution to the prediction target.
Here, we denote the sequence of patient as , and denotes the th subsequence in sequence , where . Thus, . To simplify the notation, we omit in the following explanation. Subsequence is a matrix of medical codes such that , where is the vector representation of the th medical code in the th subsequence and there are
medical codes in a subsequence. In real EHR data, it is very likely that the numbers of medical codes in each visit or time window are different, thus, we utilize the padding approach to obtain a consistent matrix dimensionality in the network.
To assign attention weights, we utilize the one-side convolution operation with a filter and a nonlinear activation function. Thus, the weight vector is generated for medical codes in the subsequence , presented in Equation 13.
where , and is the weight vector of the filter. The convolution operation is presented in Equation 14.
where is a bias term. Then, given the original matrix and the learned weights , an aggregated vector is constructed to represent the th subsequence, presented in 15.
Given Equation 15, we obtain a sequence of vectors, , to represent a patient’s medical history.
Given a sequence of embedded subsequences, this step employs the subsequence-level attention which allows the network itself to learn the weights of subsequences according to their contribution to the prediction target.
To capture the longitudinal dependencies, we utilize a bidirectional GRU-based RNN, presented in Equations 16.
where represents the output by the GRU unit at the
th subsequence. Then, we introduce a set of linear and softmax layers to generatehops of weights for subsequences. Then, for the hop
where . Thus, with the subsequence-level weights and hidden outputs, we construct a vector to represent a patient’s medical visit history with one hop of subsequence weights, presented in the following Equation 19.
Then, a context vector is constructed by concatenating , , , .
Given the context vector , this step integrates the patients characteristics into the context vector for a complete vector representation of the patient’s EHR data. In this research, the patient characteristics include demographic information and some static medical conditions, such as age, gender, and previous hospitalization. Thus, an aggregated vector is constructed, , by adding as additional dimensions to the context vector .
Given the vector representation of the complete medical history and characteristics of patients, , we add a linear and a softmax layer for the final outcome prediction, as presented in Equation 20.
where is the total number of observations. Here,
is a binary variable in classification problems, while model outputis real-valued. The second term in Equation 21 is to penalize redundancy if the attention mechanism provides similar subsequence weights for different hops of attention, which is derived from . This penalty term encourages the multiple hops to focus on diverse areas and each hop focuses on a small area.
Thus, we obtain a final output for the prediction of outcomes and a complete personalized vector representation of the patient’s longitudinal EHR data.
Although health care spending has been a relatively stable share of the Gross Domestic Product (GDP) in the United States since , the costs of hospitalization, the largest single component of health care expenditures, increased by in . Unplanned hospitalization is also distressing and can increase the risk of related adverse events, such as hospital-acquired infections and falls [34, 35]. Approximately hospitalizations in the United Kingdom are unplanned and are potentially avoidable . One important form of unplanned hospitalization is hospital re-admissions within 30 days of discharge, which is financially penalized in the United States. Early interventions targeted to patients at risk of hospitalization could help avoid unplanned admissions, reduce inpatient health care cost and financial penalties for providers, and reduce emergency department congestion .
In this research, we apply our proposed representation learning framework to the risk prediction of future hospitalization. Many studies have been conducted by researchers to predict the risk of -day readmission, or the admission risk of a particular population, such as patients with Ambulatory Care Sensitive Conditions (ACSCs), patients with heart failure, etc. [38, 39, 40, 41]. Here, we focus on the general population and the objective is to predict the risk of all-cause hospitalization using longitudinal EHR data.
In this research, we use de-identified EHR data from the University of Virginia Health System covering months beginning in September . This dataset contains inpatient and outpatient visits of distinct patients. We extracted visit data with diagnosis, medication, and procedure codes.
We defined the observation window and prediction period to validate the proposed method. We first extract all patients with a medical record of at least years, where the first year is the observation window and the medical records in this time window are used for feature construction. The following months is the hold-off period for the purpose of early detection. For the positive class, we take all patients who have hospitalization after the first years in their medical history, while the negative class consists of patients who have no hospitalization after years. To better illustrate the experimental setting, we present the observation window, hold-off and onset of outcome event in Figure 2.
Here, the medical codes include diagnosis, medication, and procedure codes, and a vector representation is learned for each code. In this dataset, diagnoses are primarily coded in ICD- and a small portion is ICD- codes, while procedures are mainly using CPT codes with a few ICD- procedure codes. The codes of medications are using the pharmaceutical categories. Overall, there are distinct medication categories, distinct diagnoses codes, and distinct procedure codes in the EHR data. The dimension of the learned vectors of medical codes is set to . Medical codes that appear in less than patients medical records are excluded as rare events.
To construct the subsequences of medical codes, we use days as the time window. Figure 3 presents the cumulative histogram and density plot of the numbers of visits in the observation window, and we observe that the majority of patients have a small number of visits during the observation window (less than of patients have more than visits). Thus, we set to days, which split the observation window into subsequences.
Within each subsequence, the number of distinct medical codes were computed and patients with more medical codes in a subsequence than the quantile were excluded from the dataset. Overall, there are and patients in the target and control groups, respectively. Each group is randomly split into training, validation and testing sets with a 7:1:2 ratio. Thus, are used for training, another is used for testing, and the rest
are used for parameter tuning and early stopping. The stochastic gradient descent algorithm is used in training to minimize the cross-entropy loss function, shown in Equation21.
To evaluate the proposed representation learning framework, we compare the prediction performance of the proposed model with baseline approaches as follows.
The inputs are the aggregated counts of grouped medical codes over the entire observation window. Since the dimensionality of raw medical codes is huge, AHRQ clinical classifications of diagnoses and procedures are used to achieve a more general clustering of medical codes . The medication codes are the pharmaceutical classes. Furthermore, patient characteristics and previous inpatient visit are also considered, where age and gender are demographic information, and a binary indicator is utilized to represent the presence of the previous hospitalization. Hence, the input is a -dimensional vector representing a patient’s medical history and characteristics.
A multi-layer perceptron is trained to predict hospitalization using the same inputs for logistic regression. Here, we use a one hidden layer MLP withhidden nodes.
We split the sequence into subsequences with equal interval . The input at each step is the counts of medical groups within the associated time interval, and the patient characteristics are appended as additional features in the final logistic regression step. Here, the RNN is a forward GRU (or LSTM ) with one hidden layer and the size of the hidden layer is .
The inputs used for this baseline is the same as the one for the FRNN-MGE . The RNN used here is a bidirectional GRU with one hidden layer and the size of the hidden layer is .
We split the sequence into subsequences with equal interval . The input at each step is the vector representation of the medical codes within the associated time interval, and the patient characteristics are appended as additional features in the final logistic regression step. Here, the RNN is a forward GRU (or LSTM ) with one hidden layer and the size of the hidden layer is .
This model uses reverse time attention mechanism on RNNs for an interpretable representation of patient’s EHR data . The inputs are the same as the one for FRNN-MGE, which takes the counts of medical grouping within each time interval to construct features. Similarly, the two RNNs used for generating weights are GRU-based and the size of the hidden layers are .
The inputs are the same as that for FRNN-MVE. One filter is used when generating weights for within-subsequence attention, and three filters are used for subsequence-level attention. Similarly, the RNN used here is GRU-based and there is one hidden layer and the size of the hidden layer is .
The inputs of all baselines and Patient2Vec
are normalized to have zero mean and unit variance. We model the risk of hospitalization based onPatient2Vec and baseline representations of patients’ medical histories, and the model performance is evaluated with Area Under Curve(AUC), sensitivity, specificity, and F2-score. The validation set is used for parameter tuning and early stopping in the training process. Each experiment is repeated
times and we calculate the averages and standard deviations of the above metrics, respectively.
The predictive performance of Patient2Vec and baselines are presented in Table I. The results shown here for the RNN-based models are based on time interval days to construct subsequences.
According to Table I, the RNN-based models are generally capable of achieving higher prediction performance in terms of sensitivity, AUC and F2 score, except for the RNN models based on medical group embedding which have lower sensitivity. Among all RNN-based approaches, the ones based on vector embedding outperform those based on medical group embedding in terms of sensitivity, AUC, and F2 score. The bidirectional RNN models generally have higher specificity but lower sensitivity than the forward RNN models, while the bidirectional ones have comparable AUC and F2 score with the forward ones, respectively. Generally, the proposed Patient2Vec framework outperforms the baseline methods, especially in terms of sensitivity and F2 score.
In addition to predictive performance, we interpret the learned representation by understanding the relative importance of clinical events in a patient’s EHR data. Considering the feature importance learned by Patient2Vec are personalized for an individual patient, we illustrate it with two example patients. Figures II and II present the profiles of two individuals, Patient A and Patient B, respectively. To facilitate the interpretation, instead of using raw medical codes, we present the clinical groups from the AHRQ clinical classification software on diagnoses and procedure codes, as well as pharmaceutical groups for medications.
|2||Other connective tissue disease|
|3||Spondylosis; intervertebral disc disorders; other back problems|
|4||Other lower respiratory disease|
|5||Disorders of lipid metabolism|
|7||Diabetes mellitus without complication|
|8||Screening and history of mental health and substance abuse codes|
|9||Other nervous system disorders|
|10||Other screening for suspected conditions (not mental disorders or infectious disease)|
|1||Other OR therapeutic procedures on nose; mouth and pharynx|
|2||Suture of skin and subcutaneous tissue|
|3||Other therapeutic procedures on eyelids; conjunctiva; cornea|
|4||Laboratory - Chemistry and hematology|
|6||Other OR therapeutic procedures of urinary tract|
|7||Other OR procedures on vessels other than head and neck|
|8||Therapeutic radiology for cancer treatment|
[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/figure5_pta_v2.pdf The profile of Patient A.
[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/figure5_ptb.pdfThe profile of Patient B.
According to Figure II, Patient A is a male patient who has hospitalization history in the observation window and is admitted to the hospital seven months after the end of the observation window for congestive heart failure. The predicted risk is , while the risk decreases for female patients or patients without hospitalization history. It is also not surprising to observe an increased risk for older patients. The heat map in Figure 4 shows the relative importance of the medical events in this patient’s medical record at each time window and the first row of the heat map presents the subsequence-level attention. The darker color indicates a stronger correlation between the clinical events and the outcome. Accordingly, we observe that the last subsequence, t4, is the most important with respect to hospitalization risk, followed by , , and in order of importance.
Among all the clinical events in the subsequence , we observe that the OR therapeutic procedures (nose, mouth, and pharynx), laboratory (chemistry and hematology), coronary atherosclerosis & other heart disease, cardiac dysrhythmias, and conduction disorders are the ones with the highest weights, while other events such as other connected tissue disease are less important in terms of future hospitalization risk. Additionally, some medications appear to be informative as well, including beta blockers, antihypertensives, anticonvulsants, anticoagulants, etc. In the first-time window, the medical events with high weights are coronary atherosclerosis & other heart disease, gastrointestinal hemorrhage, deficiency and anemia, and other aftercare. In the next subsequence, the most important medical events are heart diseases and related procedures such as coronary atherosclerosis & other heart disease, cardiac dysrhythmias, conduction disorders, hypertension with complications, other OR heart procedures, and other OR therapeutic nervous system procedures. We also observe that the kidney disease related diagnoses and procedures appear to be important features. Throughout the observation window, the coronary atherosclerosis & other heart disease, cardiac dysrhythmias, and conduction disorders constantly show high weights with respect to hospitalization risk, and the findings are consistent with medical literature.
Figure II presents the profile of Patient B, which is a male patient without hospitalization in the observation window. This patient is hospitalized for occlusion of cerebral arteries approximately one year after the observation window, and the predicted risk is . For a similar patient who is years older or with previous hospitalization history, the risk increases by and , respectively, while there is a smaller risk of hospitalization for a female patient. To illustrate the medical events of Patient B, the heat map in Figure 5 depicts the relative importance of medical groups in the subsequences, as well as the subsequence-level weights for hospitalization risk. Similarly, the darker color indicates a stronger correlation between the clinical events and the outcome. Accordingly, we observe that the second subsequence appears to be the most important, while the last one is less predictive of future hospitalization. In fact, the medical events in the last time window are spondylosis, intervertebral disc disorders, other back problems and other bone disease & musculoskeletal deformities, and malaise and fatigue, which are not highly related to the cause of hospitalization of Patient B.
|In patients admitted for osteoarthritis|
|2||Other connective tissue disease|
|3||Other non-traumatic joint disorders|
|4||Spondylosis; intervertebral disc disorders; other back problems|
|In patients admitted for septicemia|
|2||Diabetes mellitus without complication|
|3||Disorders of lipid metabolism|
|4||Other lower respiratory disease|
|In patients admitted for acute myocardial infarction|
|1||Coronary atherosclerosis and other heart disease|
|3||Other screening for suspected conditions (not mental disorders or infectious disease)|
|4||Other lower respiratory disease|
|5||Disorders of lipid metabolism|
|In patients admitted for congestive heart failure|
|1||Congestive heart failure (nonhypertensive)|
|2||Coronary atherosclerosis and other heart disease|
|4||Diabetes mellitus without complication|
|5||Other lower respiratory disease|
|In patients admitted for diabetes mellitus with complications|
|1||Diabetes mellitus with complications|
|2||Diabetes mellitus without complication|
|4||Other nutritional; endocrine; and metabolic disorders|
|5||Fluid and electrolyte disorders|
In the most predictive subsequence, , we observe that other OR heart procedures, genitourinary symptoms, spondylosis, intervertebral disc disorders, other back problems, therapeutic procedures on eyelid, conjunctiva, and cornea, and arterial blood gases have high attention weights. In the earliest time window, the most important medical events also include therapeutic procedures on eyelid, conjunctiva, and cornea, arterial blood gases, while diabetes, hypertension as well as diagnostic products show their relatively high importance. Throughout the observation window, medical events spondylosis, intervertebral disc disorders, other back problems, therapeutic procedures on eyelid, conjunctiva, and cornea are constantly with high attention weights. Here, diagnostic products is a medication class, which include barium sulfate, iohexol, gadopentetate dimeglumine, iodixanol, tuberculin purified protein derivative, iodixanol, regadenoson, acetone (urine), and so forth. These medications are primarily for blood or urine testing, or used as radiopaque contrast agents for x-rays or CT scans for diagnostic purposes.
Additionally, we attempt to interpret the learned representation and feature importance at the population-level. In Table II, we present the top clinical groups with high weights among hospitalized patients in the test set.
According to Table II, the most predictive diagnosis groups for future hospitalization are chronic diseases, including essential hypertension, diabetes, lower respiratory disease, disorders of lipid metabolism, and musculoskeletal diseases such as other connective tissue disease and spondylosis, intervertebral disc disorders, other back problems. The most important procedures are some OR therapeutic procedures and laboratory tests, such as the OR procedures on nose, mouth, and pharynx, vessels, urinary tract, eyelid, conjunctiva, cornea, etc. It is not surprising to see that diagnostic products are showing with high weights, considering these medications are used in testing or examinations for diagnostic purposes.
Moreover, we present the top diagnoses groups with high weights in patients hospitalized for different primary causes. Table III shows the top diagnosis groups with high weights in patients admitted for osteoarthritis, septicemia (except in labor), acute myocardial infarction, congestive heart failure (nonhypertensive), and diabetes mellitus with complications, respectively. Accordingly, we observe that the most important diagnoses for hospitalization risk prediction in population admitted for osteoarthritis are musculoskeletal diseases such as connective tissue disease, joint disorders, and spondylosis. However, the diagnoses with highest weights in the patients admitted for septicemia are chronic diseases including essential hypertension, diabetes, disorders of lipid metabolism, and respiratory disease. The top diagnoses have many overlaps between the populations admitted for acute myocardial infarction and for congestive heart failure, considering both populations are admitted for heart diseases. Here, the overlapped diagnosis groups include coronary atherosclerosis and other heart diseases and lower respiratory diseases. As for patients admitted for diabetes with complications, the top diagnoses are diabetes with or without complications, nutritional, endocrine, metabolic disorders, and fluid and electrolyte disorders. In general, the learned feature importance is consistent with medical literature.
Our proposed framework is applied to the prediction of hospitalization using real EHR data that demonstrates its prediction accuracy and interpretability. This work could be further enhanced by incorporating the follow-up information on the negative patient population and investigate if it indeed shows an improved health outcome or the patient is hospitalized elsewhere. Patient2Vec employs a hierarchical attention mechanism, allowing us to directly interpret the weights of clinical events. In future work, we will extend the attention to incorporate demographic information for a more comprehensive and automatic interpretation.
Although we apply Patient2Vec to the early detection of long-term hospitalization, i.e., at least 6 months after the previous hospitalization, it could be used to predict the risk of 30-day readmission to help prevent unnecessary rehospitalizations.
In this paper, we propose a representation learning framework, Patient2Vec, to learn a personalized interpretable deep representation of EHR data based on recurrent neural networks and the attention mechanism. This work improves the performance of predictive models as well as deepens the understanding of disease correlations. We apply this framework to the risk prediction of hospitalization using patients’ longitudinal EHR data. The experimental results demonstrate that the proposed Patient2Vec representation is capable of achieving a more accurate prediction than baselines approaches. Moreover, the learned feature importance in the representations are interpreted both at the individual and population levels to facilitate clinical insights.
In this work, the proposed Patient2Vec framework is evaluated with the risk prediction of all-cause hospitalization, but in the future could be applied to predict hospitalization in more specific populations, other health related prediction problems, or domains outside of health.