Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record

10/10/2018 ∙ by Jinghe Zhang, et al. ∙ 2

The wide implementation of electronic health record (EHR) systems facilitates the collection of large-scale health data from real clinical settings. Despite the significant increase in adoption of EHR systems, this data remains largely unexplored, but presents a rich data source for knowledge discovery from patient health histories in tasks such as understanding disease correlations and predicting health outcomes. However, the heterogeneity, sparsity, noise, and bias in this data present many complex challenges. This complexity makes it difficult to translate potentially relevant information into machine learning algorithms. In this paper, we propose a computational framework, Patient2Vec, to learn an interpretable deep representation of longitudinal EHR data which is personalized for each patient. To evaluate this approach, we apply it to the prediction of future hospitalizations using real EHR data and compare its predictive performance with baseline methods. Patient2Vec produces a vector space with meaningful structure and it achieves an AUC around 0.799 outperforming baseline methods. In the end, the learned feature importance can be visualized and interpreted at both the individual and population levels to bring clinical insights.



There are no comments yet.


page 1

page 3

page 5

page 6

page 8

page 10

page 11

page 14

Code Repositories


Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Longitudinal EHR data resemble text documents from many perspectives. A text document consists of a sequence of sentences, and a sentence is a sequence of words. Similarly, the longitudinal health record of a patient consists of a sequence of visits, and there is a list of clinical events, including diagnoses, medications, and procedures, that occur during a visit. Considering these similarities, representation learning methods for text documents in Natural Language Processing (NLP) have great potential to be applied to longitudinal EHR data.

Deep neural networks have become very popular in the NLP field and have been very successful in many applications, such as machine translation, question answering, text classification, document summarization, language modeling, etc. 

[1, 2, 3, 4, 5, 6, 7, 8]. These networks excel at complex language tasks because they are capable of identifying high-order relationships, the network structure can encode language structures, and they allow the learning of a hierarchical representation of the language, i.e., representations for tokens, phrases, and sentences, etc.

Among a variety of deep learning methods, Recurrent Neural Networks (RNNs) have shown their effectiveness in NLP tasks because they have the ability to capture sequential information 

[8, 7, 9, 10] which is inherent in human language. Traditional neural networks assume that inputs are independent of each other, while an RNN computes the output based on the current input as well as the “memory” from the previous computation. Although vanilla RNNs are not good at capturing long-term dependencies, many variants have been proposed and validated that are effective in addressing this issue.

In the medical domain, it is critical that analytical results are interpretable, so that they can be understood and validated by a human with expert knowledge and so that knowledge captured by analysis can be used for process improvement. Traditional deep neural networks have the disadvantage that they lack interpretability. A substantial amount of work is ongoing to make sense of the “black box”, and the attention mechanism [11] is one of the more effective methods recently developed to make the output of these algorithms more interpretable.

Health care is undergoing unprecedented change, and there is a great potential and demand for personalized care strategies. Personalized medicine, also called precision medicine, has previously focused on optimizing therapy to better fit the genetic makeup of the patient or the disease (e.g., the genetic susceptibility of cancer to specific chemotherapy strategies). The availability of EHR data and advances in machine learning create the potential for another type of personalization of healthcare. This type of personalization has become ubiquitous in our daily life. For example, customers have come to expect personalized search on Google and personalized product recommendations on Amazon and Netflix, based on their charactersitics and previous experiences with the systems. Personalization of healthcare processes, based on a patient’s phenotype (physical and medical characteristics) and healthcare experiences as documented in the health record, may also improve ”customer” satisfaction and it has the additional potential to improve healthcare efficiency, lower costs, and yield better outcomes. We believe that representation learning methods can capture a personalized representation of the important heterogeneities in patients’ phenotypes and medical histories at the population-level, and make these representations available to drive healthcare decisions and strategies.

This research is based on RNN models and the attention mechanism with the objective of learning a personalized, interpretable, and complete representation of patients’ medical records. Our proposed framework is capable of learning a personalized representation for each patient from a sequence of clinical events. A hierarchical attention mechanism learns personalized weights of clinical events, including hospital visits and the procedures that they contain. These weights allow us to interpret the relative importance and roles of clinical events in the learned representations both at individual and population levels. The ultimate goal is more accurate prediction and better insight into the critical elements of healthcare processes that can be used to improve healthcare delivery.

The rest of this paper is organized as follows: Section II summarizes the variants of RNNs and the attention mechanism, as well as their application to EHR data. Section III presents an overview of the proposed Patient2Vec representation learning framework, and Section IV elaborates the details of the algorithms. In Section V, the proposed framework is evaluated for a prediction task and we compare its performance with other baseline methods. In addition to prediction performance, we further interpret the learned representations with visualizations on example patients and events. Finally, Section V provides a summary of this work.

Ii Related Work

In this section, we present an overview of a gated recurrent unit, a type of RNN, which is capable of capturing long-term dependencies. Then we briefly introduce attention mechanisms in neural networks that allow the network to attend to certain regions of data, which is inspired by the visual attention mechanism in humans. Additionally, we summarize the RNN networks and attention mechanisms previously used to mine EHR data.

Ii-a Recurrent Neural Networks (RNN)

RNNs are expected to learn long-term dependencies by taking the previous state and the new input in the computation at the current time step

. However, vanilla RNNs are incapable of capturing the dependencies when the sequence is very long due to the vanishing gradient problem 


. Many variants of the RNN network have been proposed to address this issue, and long short term memory (LSTM) is one of the most popular models used nowadays in NLP tasks 

[13, 14, 7, 8, 15, 16]. [h](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/LSTM_GRU.pdfThe top figure is a GRU gating unit and bottom figure shows an LSTM unit [7]

Ii-A1 Gated Recurrent Unit (GRU)

GRU is a simplified version of LSTM [7]. The basic idea of GRU is to combat the vanishing gradient problem with a gating mechanism. Hence the general recurrent structure in GRU is identical to vanilla RNNs except that a GRU unit is used in the computation at each time step rather than a traditional simple recurrent unit.

In general, a GRU cell has two gates, i.e., a reset gate  and an update gate . The reset gate is used to determine how to integrate the previous state into the computation of the current state, while the update gate determines how much the unit updates its activation.

Given the input  at time step , the reset gate  is computed as presented in Equation 1


where  and  are the weight matrices of the reset gate and  is the hidden activation at time step . A similar computation is performed for the update gate  at time step , shown in Equation 2


where  and  are the weight matrices of update gate. The current hidden activation  is computed by


where is the candidate activation at time step . The computation of is presented in Equation 4


where and are weight matrices and represents element-wise multiplication. Figure II-A presents a graphical illustration of the GRU [7] and one unit of LSTM.

GRU is capable of learning long-term dependencies [17] due to the additive component of update from to in the gating mechanism. Consequently, important features will be carried forward in the input stream while irrelevant information will be dropped. When the reset gate is

, the network is forced to drop previous states and reset with current information. Moreover, the method provides shortcuts such that the error is easily backpropagated without vanishing too quickly 

[5, 18]. Hence, the GRU is well-suited to learn long-term dependencies in sequence data.

Ii-A2 Long Short-Term Memory (LSTM)

An LSTM unit is similar to a GRU, but with one more gate in an LSTM unit (as shown in Figure II-A). LSTM also preserves long term dependencies more effectively than basic RNN. This is particularly useful to overcome the vanishing gradient problem [19]. Although LSTM has a chain-like structure similar to RNN, LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state. Figure II-A shows the basic cell of an LSTM model. A step by step explanation of an LSTM cell is as following:
Input gate:


Candid memory cell value:


Forget gate activation:


New memory cell value:


Output gate value:


In the above description all 

represent bias vectors, all 

represent weight matrices, and  is used as input to the memory cell at time . Also,the  indices refer to input, cell memory, forget and output gates respectively. An RNN can be biased when later words are more influential than the earlier ones.

Empirically, LSTM and GRU achieve comparable performance in many tasks but there are fewer parameters in a GRU, which makes it a little faster to learn and able to generalize with fewer data [20].

Ii-B Attention Mechanism

Attention mechanisms, inspired by the visual attention system found in humans, have become popular in deep learning. Attention allows the network to focus on certain regions of data, while perceiving other regions with “low resolution”. In addition to higher accuracy, it also facilitates the interpretation of learned representations. We elaborate an attention mechanism on an RNN network, and Figure II-B presents a graphical illustration.

[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/Global_attention.pdfThe global attention model

According to Figure II-B, a variable-length weight vector is learned based on hidden states [11]. Then a global context vector is computed based on weights and all the hidden states to create the final output. Equation 11 presents the computation of the weight vector , where is the length of the sequence


and where

is a nonlinear activation function, usually

or . Then, the context vector is constructed as:


Thus, the network puts more attention on the important features for the final prediction which can improve the model performance. An additional benefit is that the weights can be utilized to understand the importance of features such that the models are more interpretable. The attention mechanism has been introduced to both Convolutional Neural Networks (CNNs) and RNNs for various tasks and has achieved many successes in the fields of computer vision and NLP 

[11, 21, 22].

Ii-C Deep Learning in EHR Data

Previous studies on EHR data mainly use statistical methods or traditional machine learning techniques. Recently researchers have started adapting deep learning approaches to this data [23, 24], including textual notes, temporal measurements of laboratory testing in the Intensive Care Unit (ICU), and longitudinal data in patient populations. Here, we summarize deep learning research in mining EHR data and focus on the studies using RNN-based models.

Hospitalized patients, especially patients in ICUs, are continuously monitored for cardiac, respiratory, and other physical functions, creating a large volume of sequential data in multiple dimensions. These measurements are utilized by physicians to make diagnostic and treatment decisions. The functions monitored may change over time and monitoring may be irregular, based on a patient’s condition. It is very challenging for traditional machine learning methods to mine this multivariate time series data considering missing values, varying length, and irregular, non-simultaneous sampling. Lipton et al. [25]

trained an LSTM with a replicated target to learn from these sequence data and used this model to make predictions of diagnoses. The data used in this research are time series of clinical measurements with continuous values, and the LSTM models outperformed logistic regression and MLP.   

Che et al. [26] developed a GRU-based model to address missing values in multivariate time series data, in which the missing patterns are incorporated for improved prediction performance. This work has been applied to the Medical Information Mart for Intensive Care III (MIMIC-III) clinical database to demonstrate its effectiveness in mining time series of clinical measurements with missing values [27]. Longitudinal EHR data including clinical events, such as diagnoses, medications, and procedures is also a potentially rich resource for predictive modeling. Choi et al. [28] analyze this data with a GRU network to forecast future clinical events, and it achieves a better prediction performance than comparison models such as logistic regression and MLP.

Difficulty in interpreting model behavior is one of the major drawbacks of using deep learning to mine EHR data. Some attempts have been made to address this issue. Che et al. [29]

propose an interpretable mimic learning method which trains a mimic gradient boosting trees model to utilize predicted labels or features learned by deep learning models for final prediction 

[30]. Then the feature importances learned by the tree-based models are used for knowledge discovery. Attention mechanisms have been introduced recently to improve the interpretability of the prediction results of deep learning models in health analytics.  Choi et al. [31] develop an interpretable model with two levels of attention weights learned from two reverse-time GRU models, respectively. The experimental results on EHR data indicate comparable prediction performance with conventional GRU models but more interpretable results. Our work continues the attempt to use attention mechanisms to improve the interpretability of RNN-based models.

Iii Patient2Vec System Model

In this section, we provide an overview of the proposed hierarchical representation learning framework. This framework uses deep recurrent neural networks to capture the complex relationships between clinical events in the patient’s EHR data and employs the attention mechanism to learn a personalized representation and to obtain relative feature importance. The proposed representation learning framework contains four steps and is presented graphically in Figure III.

[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/figure5_system.pdfThe Patient2Vec representation learning framework

Iii-a Learning vector representations of medical codes

EHR data consists primarily of records of outpatient and inpatient visits to healthcare providers. These visit records include multiple clinical codes for diagnoses, symptoms, procedures, therapies, and other observations and events that occurred during the visit. Here, we treat the set of medical codes associated with a visit as a sentence consisting of words, except that there is no ordering in the words. Thus, we adopt the word2vec approach to construct a vector to represent each medical code.

Iii-B Learning within-subsequence self-attention

Clinical visits are represented as the set of vectors for the codes associated with the visit. Because closely-spaced visits are usually related clinically, we employ a time window to split the sequence of visits into multiple subsequences of equal length. A subsequence might contain multiple visits if they occurred within the same time window, or there might be no visits during a particular time window yielding an empty subsequence. Thus we transform the original sequence of irregularly-spaced visits into a sequence of subsequences with equal intervals, which is preferable for recurrent neural networks. The width of the subsequence window defines the time granularity of the method and its optimal width is related to the acuity (i.e., stability) of the clinical characteristics involved in the predication task. In future work it may be possible to define the relationship between clinical acuity and optimal subsequence width, or develop methods for learning an optimal width for a defined prediction task.

Because all medical events occurring within a subsequence are unlikely to contribute equally to the prediction of the target outcome, we cannot aggregate them with equal weights. Instead, we employ a self-attention mechanism which trains the network to learn the weights.

Iii-C Learning subsequence-level self-attention

Given a sequence of subsequences with embedded medical codes, we are able to input it into a recurrent neural network to capture the temporal dependencies between events. However, the subsequences of visits are not contributing equally to the outcome. Hence, we employ another level of attention to learn the weights of the subsequences by the network itself for the outcome prediction.

Iii-D Constructing aggregated deep representation

Given the learned weights and hidden outputs, we aggregate them into one universal vector for a comprehensive representation. In this step, the static information, such as age, gender, previous hospitalization history is added as extra features, to get a complete representation of a patient.

Iii-E Predicting outcome

Given the complete vector representation of a patient’s EHR data, we add a logistic regression layer at the end for the prediction of outcome.

Iv Patient2Vec Representation Learning Algorithm

In this section, we present the details of the proposed representation learning framework, which is based on a GRU network and a hierarchical attention mechanism. Figure 1 presents the structure of the proposed network with attention.

Fig. 1: A graphical illustration of the network in the Patient2Vec representation learning framework

The proposed framework consists of five parts presented in the following: I) Learning vector representations of medical codes, II) Learning within-subsequence self-attention, III) Learning subsequence-level self-attention, IV) Constructing aggregated deep representation, V) Predicting outcome.

Iv-a Learning vector representations of medical codes

Given a patient’s raw EHR data, a sequence of visits, we observe that a visit usually contains multiple medical codes. Hence, it is feasible to learn a vector to represent the medical code by capturing the relationships between the codes. In this work, we employ the classical word2vec algorithm, skip-gram. The basic idea of skip-gram is to learn a vector to represent each word such that the probability of the context to predict based on the target word is maximized. Hence, the vectors of similar words are close to each other in the learned feature space. In the skip-gram model, the vectors are learned by training a shallow neural network to predict the context words given an input word. Similarly, in our problem, the input is a medical code and the target to predict are the medical codes occurred in the same visit.

Hence, each subsequence is a matrix consisting of the vectors of medical codes occurred during this associated time window.

Iv-B Learning within-subsequence self-attention

Given a sequence of subsequences encoded by vectors of medical codes, this step employs the within-subsequence attention which allows the network itself to learn the weights of vectors in the subsequence according to its contribution to the prediction target.

Here, we denote the sequence of patient  as , and  denotes the th subsequence in sequence , where . Thus, . To simplify the notation, we omit in the following explanation. Subsequence is a matrix of medical codes such that , where is the vector representation of the th medical code in the th subsequence and there are

medical codes in a subsequence. In real EHR data, it is very likely that the numbers of medical codes in each visit or time window are different, thus, we utilize the padding approach to obtain a consistent matrix dimensionality in the network.

To assign attention weights, we utilize the one-side convolution operation with a filter and a nonlinear activation function. Thus, the weight vector is generated for medical codes in the subsequence , presented in Equation 13.


where , and is the weight vector of the filter. The convolution operation is presented in Equation 14.


where is a bias term. Then, given the original matrix and the learned weights , an aggregated vector is constructed to represent the th subsequence, presented in 15.


Given Equation 15, we obtain a sequence of vectors, , to represent a patient’s medical history.

Iv-C Learning subsequence-level self-attention

Given a sequence of embedded subsequences, this step employs the subsequence-level attention which allows the network itself to learn the weights of subsequences according to their contribution to the prediction target.

To capture the longitudinal dependencies, we utilize a bidirectional GRU-based RNN, presented in Equations 16.


where represents the output by the GRU unit at the

th subsequence. Then, we introduce a set of linear and softmax layers to generate

hops of weights for subsequences. Then, for the hop


where . Thus, with the subsequence-level weights and hidden outputs, we construct a vector to represent a patient’s medical visit history with one hop of subsequence weights, presented in the following Equation 19.


Then, a context vector is constructed by concatenating , , , .

Iv-D Constructing aggregated deep representation

Given the context vector , this step integrates the patients characteristics into the context vector for a complete vector representation of the patient’s EHR data. In this research, the patient characteristics include demographic information and some static medical conditions, such as age, gender, and previous hospitalization. Thus, an aggregated vector is constructed, , by adding as additional dimensions to the context vector .

Iv-E Predicting outcome

Given the vector representation of the complete medical history and characteristics of patients, , we add a linear and a softmax layer for the final outcome prediction, as presented in Equation 20.


To train the network, we use cross-entropy as the loss function, presented in Equation 



where is the total number of observations. Here,

is a binary variable in classification problems, while model output

is real-valued. The second term in Equation 21 is to penalize redundancy if the attention mechanism provides similar subsequence weights for different hops of attention, which is derived from [32]. This penalty term encourages the multiple hops to focus on diverse areas and each hop focuses on a small area.

Thus, we obtain a final output for the prediction of outcomes and a complete personalized vector representation of the patient’s longitudinal EHR data.

V Evaluation

V-a Background

Although health care spending has been a relatively stable share of the Gross Domestic Product (GDP) in the United States since , the costs of hospitalization, the largest single component of health care expenditures, increased by  in  [33]. Unplanned hospitalization is also distressing and can increase the risk of related adverse events, such as hospital-acquired infections and falls [34, 35]. Approximately hospitalizations in the United Kingdom are unplanned and are potentially avoidable [36]. One important form of unplanned hospitalization is hospital re-admissions within 30 days of discharge, which is financially penalized in the United States. Early interventions targeted to patients at risk of hospitalization could help avoid unplanned admissions, reduce inpatient health care cost and financial penalties for providers, and reduce emergency department congestion [37].

In this research, we apply our proposed representation learning framework to the risk prediction of future hospitalization. Many studies have been conducted by researchers to predict the risk of -day readmission, or the admission risk of a particular population, such as patients with Ambulatory Care Sensitive Conditions (ACSCs), patients with heart failure, etc. [38, 39, 40, 41]. Here, we focus on the general population and the objective is to predict the risk of all-cause hospitalization using longitudinal EHR data.

V-B Experimental Design

In this research, we use de-identified EHR data from the University of Virginia Health System covering months beginning in September . This dataset contains inpatient and outpatient visits of distinct patients. We extracted visit data with diagnosis, medication, and procedure codes.

We defined the observation window and prediction period to validate the proposed method. We first extract all patients with a medical record of at least  years, where the first year is the observation window and the medical records in this time window are used for feature construction. The following  months is the hold-off period for the purpose of early detection. For the positive class, we take all patients who have hospitalization after the first years in their medical history, while the negative class consists of patients who have no hospitalization after years. To better illustrate the experimental setting, we present the observation window, hold-off and onset of outcome event in Figure 2.

Fig. 2: A graphical illustration of the experimental setting for the risk prediction of hospitalization

Here, the medical codes include diagnosis, medication, and procedure codes, and a vector representation is learned for each code. In this dataset, diagnoses are primarily coded in ICD- and a small portion is ICD- codes, while procedures are mainly using CPT codes with a few ICD- procedure codes. The codes of medications are using the pharmaceutical categories. Overall, there are  distinct medication categories,  distinct diagnoses codes, and  distinct procedure codes in the EHR data. The dimension of the learned vectors of medical codes is set to . Medical codes that appear in less than  patients medical records are excluded as rare events.

To construct the subsequences of medical codes, we use days as the time window. Figure 3 presents the cumulative histogram and density plot of the numbers of visits in the observation window, and we observe that the majority of patients have a small number of visits during the observation window (less than  of patients have more than  visits). Thus, we set  to  days, which split the observation window into  subsequences.

Fig. 3: The cumulative histogram and density plot of patients’ numbers of visits

Within each subsequence, the number of distinct medical codes were computed and patients with more medical codes in a subsequence than the quantile were excluded from the dataset. Overall, there are and patients in the target and control groups, respectively. Each group is randomly split into training, validation and testing sets with a 7:1:2 ratio. Thus,  are used for training, another  is used for testing, and the rest 

are used for parameter tuning and early stopping. The stochastic gradient descent algorithm is used in training to minimize the cross-entropy loss function, shown in Equation 


To evaluate the proposed representation learning framework, we compare the prediction performance of the proposed model with baseline approaches as follows.

V-B1 Logistic regression (LR)

The inputs are the aggregated counts of grouped medical codes over the entire observation window. Since the dimensionality of raw medical codes is huge, AHRQ clinical classifications of diagnoses and procedures are used to achieve a more general clustering of medical codes [42]. The medication codes are the pharmaceutical classes. Furthermore, patient characteristics and previous inpatient visit are also considered, where age and gender are demographic information, and a binary indicator is utilized to represent the presence of the previous hospitalization. Hence, the input is a -dimensional vector representing a patient’s medical history and characteristics.

Methods Sensitivity Specificity AUC F2 score
TABLE I: The predictive performance of baselines and the proposed Patient2Vec framework

V-B2 Multi-layer perceptron (MLP)

A multi-layer perceptron is trained to predict hospitalization using the same inputs for logistic regression. Here, we use a one hidden layer MLP with 

hidden nodes.

V-B3 Forward RNN with medical group embedding (FRNN-MGE)

We split the sequence into subsequences with equal interval . The input at each step is the counts of medical groups within the associated time interval, and the patient characteristics are appended as additional features in the final logistic regression step. Here, the RNN is a forward GRU (or LSTM [18]) with one hidden layer and the size of the hidden layer is .

V-B4 Bidirectional RNN with medical group embedding (BiRNN-MGE)

The inputs used for this baseline is the same as the one for the FRNN-MGE [15]. The RNN used here is a bidirectional GRU with one hidden layer and the size of the hidden layer is .

V-B5 Forward RNN with medical vector embedding (FRNN-MVE)

We split the sequence into subsequences with equal interval . The input at each step is the vector representation of the medical codes within the associated time interval, and the patient characteristics are appended as additional features in the final logistic regression step. Here, the RNN is a forward GRU (or LSTM [28]) with one hidden layer and the size of the hidden layer is .

V-B6 Bidirectional RNN with medical vector embedding (BiRNN-MVE)

The inputs used for this baseline is the same as the one for the FRNN-MVE [25]. The RNN used here is a bidirectional GRU or LSTM [15] with one hidden layer and the size of the hidden layer is .

V-B7 Retain

This model uses reverse time attention mechanism on RNNs for an interpretable representation of patient’s EHR data [31]. The inputs are the same as the one for FRNN-MGE, which takes the counts of medical grouping within each time interval to construct features. Similarly, the two RNNs used for generating weights are GRU-based and the size of the hidden layers are .

V-B8 Patient2Vec

The inputs are the same as that for FRNN-MVE. One filter is used when generating weights for within-subsequence attention, and three filters are used for subsequence-level attention. Similarly, the RNN used here is GRU-based and there is one hidden layer and the size of the hidden layer is .

The inputs of all baselines and Patient2Vec

are normalized to have zero mean and unit variance. We model the risk of hospitalization based on

Patient2Vec and baseline representations of patients’ medical histories, and the model performance is evaluated with Area Under Curve(AUC), sensitivity, specificity, and F2-score. The validation set is used for parameter tuning and early stopping in the training process. Each experiment is repeated

times and we calculate the averages and standard deviations of the above metrics, respectively.

V-C Experimental Results

The predictive performance of Patient2Vec and baselines are presented in Table I. The results shown here for the RNN-based models are based on time interval days to construct subsequences.

According to Table I, the RNN-based models are generally capable of achieving higher prediction performance in terms of sensitivity, AUC and F2 score, except for the RNN models based on medical group embedding which have lower sensitivity. Among all RNN-based approaches, the ones based on vector embedding outperform those based on medical group embedding in terms of sensitivity, AUC, and F2 score. The bidirectional RNN models generally have higher specificity but lower sensitivity than the forward RNN models, while the bidirectional ones have comparable AUC and F2 score with the forward ones, respectively. Generally, the proposed Patient2Vec framework outperforms the baseline methods, especially in terms of sensitivity and F2 score.

Fig. 4: The heat map showing feature importance for Patient A

V-D Visualization & Interpretation

In addition to predictive performance, we interpret the learned representation by understanding the relative importance of clinical events in a patient’s EHR data. Considering the feature importance learned by Patient2Vec are personalized for an individual patient, we illustrate it with two example patients. Figures II and II present the profiles of two individuals, Patient A and Patient B, respectively. To facilitate the interpretation, instead of using raw medical codes, we present the clinical groups from the AHRQ clinical classification software on diagnoses and procedure codes, as well as pharmaceutical groups for medications.

Index Clinical Groups
1 Essential hypertension
2 Other connective tissue disease
3 Spondylosis; intervertebral disc disorders; other back problems
4 Other lower respiratory disease
5 Disorders of lipid metabolism
6 Other aftercare
7 Diabetes mellitus without complication
8 Screening and history of mental health and substance abuse codes
9 Other nervous system disorders
10 Other screening for suspected conditions (not mental disorders or infectious disease)
1 Other OR therapeutic procedures on nose; mouth and pharynx
2 Suture of skin and subcutaneous tissue
3 Other therapeutic procedures on eyelids; conjunctiva; cornea
4 Laboratory - Chemistry and hematology
5 Other laboratory
6 Other OR therapeutic procedures of urinary tract
7 Other OR procedures on vessels other than head and neck
8 Therapeutic radiology for cancer treatment
1 Diagnostic Products
2 Analgesics-Narcotic
TABLE II: The top clinical groups with high weights in hospitalized patients

[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/figure5_pta_v2.pdf The profile of Patient A.

[t!](topskip=0pt, botskip=0pt, midskip=0pt)[width=.48]figures/figure5_ptb.pdfThe profile of Patient B.

According to Figure II, Patient A is a male patient who has hospitalization history in the observation window and is admitted to the hospital seven months after the end of the observation window for congestive heart failure. The predicted risk is , while the risk decreases for female patients or patients without hospitalization history. It is also not surprising to observe an increased risk for older patients. The heat map in Figure 4 shows the relative importance of the medical events in this patient’s medical record at each time window and the first row of the heat map presents the subsequence-level attention. The darker color indicates a stronger correlation between the clinical events and the outcome. Accordingly, we observe that the last subsequence, t4, is the most important with respect to hospitalization risk, followed by , , and in order of importance.

Among all the clinical events in the subsequence , we observe that the OR therapeutic procedures (nose, mouth, and pharynx), laboratory (chemistry and hematology), coronary atherosclerosis & other heart disease, cardiac dysrhythmias, and conduction disorders are the ones with the highest weights, while other events such as other connected tissue disease are less important in terms of future hospitalization risk. Additionally, some medications appear to be informative as well, including beta blockers, antihypertensives, anticonvulsants, anticoagulants, etc. In the first-time window, the medical events with high weights are coronary atherosclerosis & other heart disease, gastrointestinal hemorrhage, deficiency and anemia, and other aftercare. In the next subsequence, the most important medical events are heart diseases and related procedures such as coronary atherosclerosis & other heart disease, cardiac dysrhythmias, conduction disorders, hypertension with complications, other OR heart procedures, and other OR therapeutic nervous system procedures. We also observe that the kidney disease related diagnoses and procedures appear to be important features. Throughout the observation window, the coronary atherosclerosis & other heart disease, cardiac dysrhythmias, and conduction disorders constantly show high weights with respect to hospitalization risk, and the findings are consistent with medical literature.

Fig. 5: The heat map showing feature importance for Patient B

Figure II presents the profile of Patient B, which is a male patient without hospitalization in the observation window. This patient is hospitalized for occlusion of cerebral arteries approximately one year after the observation window, and the predicted risk is . For a similar patient who is  years older or with previous hospitalization history, the risk increases by  and , respectively, while there is a smaller risk of hospitalization for a female patient. To illustrate the medical events of Patient B, the heat map in Figure 5 depicts the relative importance of medical groups in the subsequences, as well as the subsequence-level weights for hospitalization risk. Similarly, the darker color indicates a stronger correlation between the clinical events and the outcome. Accordingly, we observe that the second subsequence appears to be the most important, while the last one is less predictive of future hospitalization. In fact, the medical events in the last time window are spondylosis, intervertebral disc disorders, other back problems and other bone disease & musculoskeletal deformities, and malaise and fatigue, which are not highly related to the cause of hospitalization of Patient B.

Index Diagnosis Groups
In patients admitted for osteoarthritis
1 Osteoarthritis
2 Other connective tissue disease
3 Other non-traumatic joint disorders
4 Spondylosis; intervertebral disc disorders; other back problems
5 Other aftercare
In patients admitted for septicemia
1 Essential hypertension
2 Diabetes mellitus without complication
3 Disorders of lipid metabolism
4 Other lower respiratory disease
5 Other aftercare
In patients admitted for acute myocardial infarction
1 Coronary atherosclerosis and other heart disease
2 Medical examination/evaluation
3 Other screening for suspected conditions (not mental disorders or infectious disease)
4 Other lower respiratory disease
5 Disorders of lipid metabolism
In patients admitted for congestive heart failure
1 Congestive heart failure (nonhypertensive)
2 Coronary atherosclerosis and other heart disease
3 Cardiac dysrhythmias
4 Diabetes mellitus without complication
5 Other lower respiratory disease
In patients admitted for diabetes mellitus with complications
1 Diabetes mellitus with complications
2 Diabetes mellitus without complication
3 Other aftercare
4 Other nutritional; endocrine; and metabolic disorders
5 Fluid and electrolyte disorders
TABLE III: The top diagnosis groups with high weights in patients hospitalized for osteoarthritis, septicemia, acute myocardial infarction, congestive heart failure, and diabetes mellitus with complications, respectively

In the most predictive subsequence, , we observe that other OR heart procedures, genitourinary symptoms, spondylosis, intervertebral disc disorders, other back problems, therapeutic procedures on eyelid, conjunctiva, and cornea, and arterial blood gases have high attention weights. In the earliest time window, the most important medical events also include therapeutic procedures on eyelid, conjunctiva, and cornea, arterial blood gases, while diabetes, hypertension as well as diagnostic products show their relatively high importance. Throughout the observation window, medical events spondylosis, intervertebral disc disorders, other back problems, therapeutic procedures on eyelid, conjunctiva, and cornea are constantly with high attention weights. Here, diagnostic products is a medication class, which include barium sulfate, iohexol, gadopentetate dimeglumine, iodixanol, tuberculin purified protein derivative, iodixanol, regadenoson, acetone (urine), and so forth. These medications are primarily for blood or urine testing, or used as radiopaque contrast agents for x-rays or CT scans for diagnostic purposes.

Additionally, we attempt to interpret the learned representation and feature importance at the population-level. In Table II, we present the top clinical groups with high weights among hospitalized patients in the test set.

According to Table II, the most predictive diagnosis groups for future hospitalization are chronic diseases, including essential hypertension, diabetes, lower respiratory disease, disorders of lipid metabolism, and musculoskeletal diseases such as other connective tissue disease and spondylosis, intervertebral disc disorders, other back problems. The most important procedures are some OR therapeutic procedures and laboratory tests, such as the OR procedures on nose, mouth, and pharynx, vessels, urinary tract, eyelid, conjunctiva, cornea, etc. It is not surprising to see that diagnostic products are showing with high weights, considering these medications are used in testing or examinations for diagnostic purposes.

Moreover, we present the top diagnoses groups with high weights in patients hospitalized for different primary causes. Table III shows the top diagnosis groups with high weights in patients admitted for osteoarthritis, septicemia (except in labor), acute myocardial infarction, congestive heart failure (nonhypertensive), and diabetes mellitus with complications, respectively. Accordingly, we observe that the most important diagnoses for hospitalization risk prediction in population admitted for osteoarthritis are musculoskeletal diseases such as connective tissue disease, joint disorders, and spondylosis. However, the diagnoses with highest weights in the patients admitted for septicemia are chronic diseases including essential hypertension, diabetes, disorders of lipid metabolism, and respiratory disease. The top diagnoses have many overlaps between the populations admitted for acute myocardial infarction and for congestive heart failure, considering both populations are admitted for heart diseases. Here, the overlapped diagnosis groups include coronary atherosclerosis and other heart diseases and lower respiratory diseases. As for patients admitted for diabetes with complications, the top diagnoses are diabetes with or without complications, nutritional, endocrine, metabolic disorders, and fluid and electrolyte disorders. In general, the learned feature importance is consistent with medical literature.

Vi Discussion

Our proposed framework is applied to the prediction of hospitalization using real EHR data that demonstrates its prediction accuracy and interpretability. This work could be further enhanced by incorporating the follow-up information on the negative patient population and investigate if it indeed shows an improved health outcome or the patient is hospitalized elsewhere. Patient2Vec employs a hierarchical attention mechanism, allowing us to directly interpret the weights of clinical events. In future work, we will extend the attention to incorporate demographic information for a more comprehensive and automatic interpretation.

Although we apply Patient2Vec to the early detection of long-term hospitalization, i.e., at least 6 months after the previous hospitalization, it could be used to predict the risk of 30-day readmission to help prevent unnecessary rehospitalizations.

Vii Conclusion

In this paper, we propose a representation learning framework, Patient2Vec, to learn a personalized interpretable deep representation of EHR data based on recurrent neural networks and the attention mechanism. This work improves the performance of predictive models as well as deepens the understanding of disease correlations. We apply this framework to the risk prediction of hospitalization using patients’ longitudinal EHR data. The experimental results demonstrate that the proposed Patient2Vec representation is capable of achieving a more accurate prediction than baselines approaches. Moreover, the learned feature importance in the representations are interpreted both at the individual and population levels to facilitate clinical insights.

In this work, the proposed Patient2Vec framework is evaluated with the risk prediction of all-cause hospitalization, but in the future could be applied to predict hospitalization in more specific populations, other health related prediction problems, or domains outside of health.