I Introduction
With the rapid growth of the utilization of healthcare information systems during the last few decades, huge volumes of electronic health records (EHR) have been accumulated. The patient EHR data typically consists of a sequence of visit records with irregular admitted intervals, and each visit consists of admission and discharge timestamps and a set of clinical events, such as diagnoses, procedures, medications, etc. [1, 2]. Fig. 1 shows an EHR segment of a patient, which is referred to as the patient journey in the paper. Analyzing the EHR data to benefit the care for a large number of patients has been attracting tremendous attentions from both academia and industry. One of the numerous analytical tasks is to predict the future diagnoses [3, 4, 2, 5, 6] based on a patient’s historical EHR data. For example, [3] and [4]
employ recurrent neural networks (RNN) to integrate medical ontology for capturing temporal visits and predicting sequential diagnoses. Ref.
[2] predict future diagnoses by utilizing cooccurrence statistics and multiple ontological representations via attention mechanism on EHR data. Although the existing methods have achieved promising results, they are still challenged by the following two limitations.One of the major challenges is how to effectively model such irregular and temporal EHR data with admission and discharge states from hospital. Some recent works [7, 5, 8, 9, 10, 3, 11, 4, 12] directly adapt text representation learning algorithms [13] to the sequenceformatted EHR data. For example, Med2Vec [7]
learns a vector representation for each medical concept from the cooccurrence information without considering the temporal sequential nature of the EHR data. Further, considering both longterm dependency and sequential information, recurrent neural networks
[10, 3, 11, 4, 14], including LSTM [15] and GRU [16], are used to learn the contextualized representation of EHR data. However, despite the similarity between EHR data and natural language text, one major difference is that EHR data inherently includes the timestamp property. Namely, beyond dependency, there is time interval between each pair of visits. For example, as shown in Fig. 1, the interval between Visit1 and Visit2 is shorter than that between Visit2 and Visit3, indicating that the dependency between Visit1 and Visit2 may be stronger than that later one. Due to irregular visits of patients, this is very common in EMRs. Meanwhile, the previous works, which ignore the discharge states, only consider admitted ones. In other word, they rarely take the length of stay in each visit into account for patient journeys.The other limitation is that existing methods rely on a large volume of training data, which is generally not easily available due to both labourintensive costs and privacy concerns [17]
. Fortunately, some recent works in natural language processing (NLP) field provide effective solutions when taskspecific supervised data is scarce. One promising research direction is to leverage offtheshelter relational knowledge
[18, 19](e.g., Freebase and DBpedia) to enhance the model especially when the knowledge can be used as supportive evidence to the targeted task. Taking this inspiration, recent works
[3, 4, 2] train medical code embeddings upon medical ontology by using graphbased attention mechanism, and thus deliver competitive performance even with insufficient taskspecific supervised data. Despite their success in several healthcare tasks, these methods still suffer from a main limitation: rich dependency information underlying the patient’s sequential EHR is rarely exploited during medical ontology learning. For example, the medical codes from visit information and medical ontology are heterogeneous, and how to effectively learn both representations and fuse their heterogeneous features is a challenging open task.To cope with aforementioned limitations, we propose an endtoend robust transformerbased healthcare analytics model, named SEquential Diagnosis Prediction with Transformer and Ontological Representation (SETOR). SETOR integrates medical ontology to alleviate data insufficiency, and exploits neural ordinary differential equation (ODE) [20, 21] to tackle temporal irregularity occurred in both two consecutive visits and length of stay in each visit. Specifically, SETOR first employs the attentionbased graphembedding approach to learn ontological and generalized representations of medical codes to mitigate the problem of data insufficiency. Next, the ontological encoder is proposed to integrate the learned ontological representations into visit information to enhance medical representations. Then, the proposed model utilizes neural ODE to learn the discharge state based on the admitted state for each visit, called LoS (Length of Stay) ODE, and the hidden states for irregular intervals between consecutive patient’s visits, called Interval ODE. Lastly, SETOR integrates the learned hidden discharge and interval states and compressed visit vectors to predict sequential diagnoses followed by patient journey transformer. Consequently, the proposed model can improve the prediction quality of future diagnoses, and advance the robustness irrespective of sufficient or insufficient data.
To summarize, our main contributions are:

novel LoS and interval ODE representations, that use neural ODE to model discharge states and capture the irregular admitted interval dependencies between patient’s visits;

an endtoend neural network called “SETOR” that accurately predicts sequential diagnoses using neural ODE and ontological representation;

an evaluation on two realworld datasets, qualitatively demonstrating the interpretability of the learned representations of medical codes and quantitatively validating the effectiveness of the proposed model.
Ii Related Work
Deep neural networks have been applied to healthcare analytical tasks, which have recently attracted enormous interest in healthcare informatics. This section reviews two types of related studies, which are sequential prediction and medical ontologies on EHR.
Iia Sequential prediction on EHR
Sequential prediction of clinical events based on EHR data has been attracting tremendous attentions. Most existing models utilize RNNs and attention mechanism for predicting the future diagnoses. Med2Vec [7] and MIME [22] indirectly exploit an RNN to embed the visit sequences into a patient representation by multilevel representation learning to integrate visits and medical codes based on visit sequences and the cooccurrence of medical codes to predict future health outcomes. Other research works have, however, used RNNs directly to model timeordered patient visits for predicting diagnoses [11, 3, 10, 14, 4, 23, 24, 25, 26]. For example, Dipole [10] and RETAIN [11] employ RNNs to model the sequential relationships among the medical concepts, guided by future diagnoses prediction task in an endtoend learning manner. Attentionbased models, such as, MMORE [2] and MusaNet [5], have been employed to capture both visits’ dependencies and sequential information in the EHR data to predict future diagnoses. Most of these methods rarely make use of the discharge states and irregular intervals in the EHR data. Recently, neural ODE [20, 21] has been proposed to handle arbitrary time gaps between observations, which provides an opportunity to alleviate the limitations of the existing models.
IiB Medical Ontologies on EHR
Though healthcare information systems have accumulated huge volumes of EHR data, the data is generally not easily available due to both labourintensive costs to label training data and privacy concerns of patient data. Facing the challenge of insufficient data, additional medical ontologies have been utilized to improve the quality of the medical code embeddings and the predictive performance. For instance, GRAM [3]
proposes the graphbased attention model to incorporate the medical ontology with an attention mechanism and recurrent neural networks for representation learning with the application to diagnosis prediction. KAME
[4] extends GRAM model to additionally consider side information of the learned embedding of the nonleaf nodes in medical ontology, and exploits RNN to integrate the knowledge from both the medical codes and nonleaf nodes and the EHR data to predict future diagnoses. MMORE [2] learns multiple ontological representations for the nonleaf nodes in the ontology and integrates the EHR cooccurrence statistics to predict sequential diagnoses. However, these models do not mutually integrate medical codes and the ontology, leaving learning effective representations from both EHR data and the ontologies an open question.Iii Methodology
This section starts with notations of several important concepts and problem statement in the paper. The remainder mainly focuses on details of the proposed model consisting of patient journey transformer, ontological and ODE representations, and task of sequential diagnoses prediction.
Iiia Notations and Problem Statement
IiiA1 Notations
We denote the set of medical codes from the EHR data as and is the number of unique medical codes. Patients’ clinical records can be represented by a sequence of visits , which is referred to as the patient journey in the paper, where is the number of visits in the patient journey. Each visit consists of a subset of medical codes (). For clear demonstration, all algorithms will be presented with a single patient’s journey. On the other hand, a medical ontology contains the hierarchy of various medical concepts with the parentchild semantic relationship. In particular, the medical ontology is a directed acyclic graph (DAG) and the nodes of consist of leaves and their ancestors, shown in left part in Fig. 2. Each leaf node refers to a medical code in , which is associated with a sequence of ancestors from the leaf to the root of . And each ancestor node belongs to the set , where is the number of ancestor codes in . A ancestor node in the medical ontology represents a related but more general concept over its children. Thus, including these semantic relationships would help the model to improve the medical concept representation that can lead to more accurate predictions of sequential diagnoses. Table I summarizes notations we will use throughout the paper.
Notation  Description 

Set of unique medical codes in dataset  
The number of unique medical codes  
, the th medical code in  
The tth visit of the patient,  
The patient journey,  
The medical ontology, a directed acyclic graph  
Set of ancestor codes in  
Ontological embedding matrix  
Embedding matrix of medical codes  
Basic embedding vector of medical code  
The dimension of medical code embedding 
IiiA2 Problem Statement
Given a timeordered patient journey , and medical ontology , the goal of a sequential diagnosis prediction problem is to predict the next visit information. For the th visit, where , the outputs are .
IiiB Model Overview
To make the best use of the irregular and temporal properties in EHR and alleviate the challenge of insufficient data, we propose a robust and transformerbased model, called SETOR, illustrated in Fig. 2. First, the medical ontology is embedded into an ontological embedding matrix . Then, an ontological encoder aggregates both the embedded diagnoses from and an initial embedding matrix by embedding operation to learn both cooccurrence and medical knowledge. is an embedding matrix of medical codes, where represents the embedding size.
is randomly initialised by a uniform distribution, and its entries are learnable during model training in an endtoend manner. The outputs of the ontological encoder are fed into attention pooling layer to compress a set of medical codes in a visit into a vector representation. Next, our proposed LoS and Interval ODE representations are added to the learned visit representations, and the normalized outputs are fed into a journey transformer to learn the visit dependencies in a patient journey. Lastly, a predictive model, following the journey transformer, is used to predict the next visit information.
IiiC Ontological Representation
To mitigate the problem of data insufficiency in healthcare and to learn knowledgeable and generalized representations of medical codes, we employ the attentionbased graph representation approach GRAM [3]. In the medical ontology , each leaf node has a basic learnable embedding vector , where , and d represents the dimensionality. And each nonleaf node also has an embedding vector , where . is initialized with the values from the uniform distribution, and . The attentionbased graph embedding uses an attention mechanism to learn the ddimensional final embedding for each leaf node i (medical code) via:
(1) 
where denotes the set comprised of leaf node i and all its ancestors, is the ddimensional basic embedding of the node j and is the attention weight on the embedding when calculating , which is formulated by following the Softmax function,
(2) 
(3) 
where is to concatenate and in the childancestor order; , and are learnable parameters.
IiiD Ontological Encoder
To encode both visit information and medical ontology as well as fuse their heterogeneous features, we propose ontological encoder. The dotted rectangle in Fig. 2 shows the details of the encoder, where. We first calculate the code embeddings and node embedding , where
are 3dimensional tensors, the function
is to embed medical codes in patient journey according to and . Next, are fed into two different multihead selfattentions (MultiHead) [27], where is the number of medical codes in each visit of the patient journey. For simplicity, we take the th visit () in patient journey as an example to demonstrate the process of ontological encoder as follow,(4)  
where MultiHead is a function of multihead attention [27], and .
IiiD1 MultiHead Attention
The multihead attention mechanism relies on selfattention, where all of the keys, values and queries come from the same place. The selfattention operates on a query , a key and a value :
(5) 
where , , and are matrices, denotes the number of medical codes in a visit in a patient journey, denotes the dimension of embedding.
The multihead attention mechanism obtains (i.e. one per head) different representations of (), computes selfattention for each representation, concatenates the results. This can be expressed as follow:
(6) 
(7) 
where the projections are parameter matrices and , .
IiiD2 Information Integration
The ontological encoder adopts an information integration layer for the mutual integration of the code and node embedding in a visit. The process is as follows:
(8) 
where are learnable parameters, is the inner hidden state integrating the information of both the code and the node.
is the nonlinear activation function, which usually is the ReLU function.
The output of ontological encoder is denoted as follows,
(9) 
where , so that we can represent heterogeneous information of medical codes and ontology into a united feature space.
For ontological representation and encoder, we first build an embedding matrix for both leaf and nonleaf nodes in medical ontology. Then, we extract knowledge for each code from medical ontology as a tuple , and embed each code in the tuple by looking up the embedding matrix. Lastly, we calculate knowledgeenriched representation of each leaf node using Eq.(1). During this procedure, we also consider interactively encoding both visiting records and medical ontology by presenting an ontological encoder to fuse their heterogeneous features.
IiiE Attention Pooling
Attention Pooling [28, 29] explores the importance of each individual code within a visit. It works by compressing a set of medical code embeddings from the visit into a single contextaware vector representation. For simplicity, we take the th transformer output as an example. Formally, it is written as:
(10) 
where is the th row of ( ), is ReLU function and
are learnable parameters. The probability distribution is formalized as
(11) 
The final output of the attention pooling is the weighted average of sampling a code according to its importance, i.e.,
(12) 
where represents the th visit in the patient journey.
IiiF ODE Representations
Neural ODE [20, 21, 30] models the time series as a continuously changing trajectory and makes better use of the data’s timestamp information and predictions arbitrarily in time. Each trajectory is determined by the local initial state and the global set of potential dynamics shared by the alltime series. Given observation time , ,…, and initial state , generated by the ODE solver ,…, , which describes the underlying state of each observation. This generation model can be defined by the following formula:
(13)  
(14) 
where is a timeinvariant function, using a neural network with parameters . The function takes the value at the present time step and outputs the gradient at the end.
ODE is a function as in Eq.(11). The equation needs to be solved during each evaluation and begins with an initial state , which is also called initial value problem. Adjoint method by Pontryagin is employed to calculate the gradients of the ODE. With a low memory footprint, this method works by solving second, augmented ODE backwards in time and can be used with all ODE integrators. By solving the equation, the desirable sequence of hidden states can then be produced for downstream modules.
LoS ODE Representation. As each visit has two timestamps (admitted and discharge times) and each initial state (e.g. the th ) is given by the output of attention pooling, we can utilize Neural ODE to predict the discharge state as follow:
(15) 
where is the discharge state for the th visit, and are the admitted and discharge timestamps,respectively.
Interval ODE Representation. A patient journey consists of a sequence of irregular visits with timestamps. We can utilize Neural ODE to learn the hidden state for each visit timestamp following Equation 14.
(16)  
(17) 
where
is a probability distribution dependent on time, and
is the first admitted visit state of a patient journey. The outputs of LoS and Interval ODE representations are added to the outputs of attention pooling and normalized for the following layer, which is denoted as follows:(18) 
IiiG Patient Journey Transformer Module
To learn visits’ relationships in a patient journey, a module, called JTransformer, is proposed to capture inherent dependencies, as shown in Fig. 2. JTransformer responsible for learning dependencies of sequential visits in a patient journey with admission intervals and length of stay in each visit, which is calculated as follows:
(19) 
Besides, we denote the number of JTransformer layers as . is the input to JTransformer, which is depicted in Fig. 3. JTransformer is identical to its implementation in BERT [27] and [31], which has two sublayers. The first is a multihead attention mechanism mentioned in Section IIID1
, and the second is position wise fully connected feedforward network. Residual connection
[32] is employed around each of the two sublayers, followed by layer normalization [33].IiiH Sequential Diagnoses Prediction
Given a patient’s visit records , to capture the EHR sequential visit information we perform the sequential diagnoses predictive task with the objective of predicting the disease codes of the next visit , which can be expressed as follows:
(20)  
(21) 
where is the output of JTransformer to denote the representation of the ()th visit,
is the loss function,
is a vector with elements, whose value is 1 if the th diagnosis code exists in and 0 otherwise, and and are the learnable parameters.Iv Experiments
In this section, we conduct experiments on two real world medical claim datasets to evaluate the performance of the proposed SETOR. Compared with the stateoftheart predictive models, SETOR yields better performance on different evaluation strategies.
Iva Data Description
We conducted comparative studies on two realworld datasets in the experiments, which are the MIMICIII [34] and MIMICIV [35] databases.
MIII Dataset
The MIMICIII dataset [34]
is an opensource, deidentified dataset of ICU patients and their EHRs between 2001 and 2012. The diagnosis codes in the dataset follow the ICD9 standard. MIMICIII is denoted by MIII in the experiment.
MIV Dataset
The MIMICIV dataset [35] is an update to MIMICIII, which incorporates contemporary data and improves on numerous aspects of MIMICIII. The dataset consists of the medical records of 73,452 patients between 2008 and 2019. MIMICIV is denoted by MIV.
Tab. II shows the statistical details about the datasets, where the selected patients made at least two visits. MIMICIII and MIMICIV are denoted by MIII and MIV in the experiment, respectively.
Dataset  MIII  MIV 
# of patients  7,499  73,452 
# of visits  19,911  295,351 
Avg. # of visits per patient  2.66  4.02 
# of unique ICD9 codes  4,880  9,165 
Avg. # of ICD9 codes per visit  13.06  12.01 
Max # of ICD9 codes per visit  39  57 
# of category codes  272  283 
Avg. # of cat. codes per visit  11.23  10.41 
Max # of cat. codes per visit  34  37 
IvB Experimental Setup
In this subsection, we first introduce the stateoftheart approaches for diagnosis prediction task in healthcare, and then outline the measures used for predictive performance evaluation. Finally, we describe the implementation details.
Baseline Approaches
We compare the performance of our proposed model against the following stateoftheart baseline models:

RETAIN [11], which learns the medical concept embeddings and performs heart failure prediction via the reversed RNN with the attention mechanism.

Dipole [10], which uses bidirectional RNN and three attention mechanisms (locationbased, general, concatenationbased) to predict patient visit information. We chose locationbased Dipole as a baseline method.

GRAM [3]
, which is a graphbased attention model to learn the representations from the knowledge graph to predict future medical outcomes. .

KAME [4], which is a diagnosis prediction model inspired by GRAM, using medical ontology to learn representations of medical codes and their parent codes. These are then used to learn input representations of patient data which are fed into a Neural Network architecture to predict sequential diagnoses.

MMORE [2], which is based on medical ontology with an attention mechanism.
Predictive Task
The purpose of the sequential diagnosis prediction task is to predict the diagnosis information of the next visit. In the experiments, true labels are prepared by grouping the ICD9 codes into 283 groups using CCS singlelevel diagnosis grouper^{2}^{2}2https://www.hcupus.ahrq.gov/toolssoftware/ccs/AppendixASingleDX.txt. It is to improve the training speed and predictive performance, while preserving sufficient granularity for all the diagnoses. We measure the predictive performance by , which are defined as:
Sequential diagnosis prediction is a multilabel problem, so normal Accuracy is inapplicable. Following previous works, we also use accuracy@ as the metric, which measures the ratio of positive labels ranked in top
according to their logits. Specifically given a test sample, we first calculate the logits for all categories by a trained model, and rank the categories by the logits in descending order. Then, we count how many positive labels fall into top
and compute the ratio over the number of all positive labels in this sample. Lastly, we derive the final metric, Accuracy@, by averaging the ratios on the samples from the entire test set.Implementation Details
We use CCSmultilevel diagnoses hierarchy as the medical ontology. We implement all the approaches with Pytorch 1.4.0. For the training models, we use Adadelta
[36] with a minibatch of 32 patients. We randomly split the data into a training set, validation set and test set and fix the size of the validation set to be 10%. To validate the robustness against insufficient data, we vary the size of the training set from 20% to 80% and use the remaining part as the test set. The validation set is used to determine the best parameters values in the 100 training iterations. The dropout strategies (the dropout rate is 0.1) are used for all the approaches. We set dimension for all the baselines and the proposed model.IvC Results of Sequential Diagnosis Prediction
Tab. III shows the accuracy@k of the SETOR and baselines with different k on two realworld datasets for the sequential diagnoses prediction task. From Tab. III, we can observe that the performance of the proposed SETOR is better than that of all the baselines on the two datasets.
On the MIII dataset, compared with the best baseline MMORE, the accuracy of SETOR improves by 2.21%. These results suggest that adding LoS and Interval ODE representation layers when predicting diagnoses is effective. We observe that the performance obtained by the models using ontologies is better than that obtained by the models without using ontologies on MIII, which can be thought of as a small and insufficient dataset. The underlying reason is that the models integrating external medical ontologies can alleviate the issue of insufficient data.
On the MIV dataset, the proposed SETOR still outperforms all the stateoftheart diagnosis prediction approaches. Compared with the best baseline RETAIN, the accuracy of SETOR improves by 8.77% when . We also find that when not using ontologies RETAIN outperforms the other baseline models on MIV the large dataset. This implies that the model can obtain comparative performance without using ontologies when the size of training dataset is larger. Compared to the models based on GRU and attention (e.g., RETAIN and Dipole), although Dipole fuses location attention, its performance is inferior to the attentionbased models. Overall, our proposed framework exhibits better predictive power for both sufficient and insufficient datasets.
On the two datasets, the results show that the proposed model outperforms all the baselines, especially when the size of dataset is large. This demonstrates that the superiority of SETOR results from the explicit consideration of both the ontologies and the EHR cooccurrence, with the irregular intervals and discharge states being well handled.
Dataset  Model  Accuracy@k (%)  

5  10  20  30  
RETAIN  27.15  41.41  57.68  68.25  
Dipole  24.55  37.04  54.01  60.09  
MIII  GRAM  27.72  41.24  58.05  68.08 
KAME  27.98  41.81  57.31  68.02  
MMORE  28.97  43.74  61.10  71.61  
SETOR  31.18  45.80  62.36  72.46  
RETAIN  38.95  57.60  73.91  81.48  
Dipole  32.48  47.75  63.45  72.39  
MIV  GRAM  33.63  48.84  64.34  73.05 
KAME  33.56  48.80  63.94  72.64  
MMORE  34.21  50.59  66.53  75.21  
SETOR  47.72  71.21  86.26  90.32 
IvD Data Sufficiency Analysis
To analyze the influence of data sufficiency on the predictions, we randomly split the data into training, validation, and test sets, and fix the size of the validation set to be 10%. To validate the robustness against insufficient data, we vary the size of the training set to form four groups: 20%, 40%, 60%, and 80%, and use the remaining part as the test set. The training set in the 20% group are the most insufficient for training the proposed and baseline models, while the data in the 80% group are the most sufficient for training the models. Fig. 4 shows the Accuracy@20 on the MIII and MIV.
From the Fig. 3(a), we can observe that the accuracy of the proposed model is higher than that of the baselines in all groups. Specifically, MMORE is a comparative model with ontological representation and diagnosis cooccurrence over MIII, which shows that the models integrating medical ontology learns reasonable medical code embeddings to improve the prediction with insufficient data. The performance of Dipole is inferior to other baseline models, which indicates that the model only taking the sequential information into accounts is not enough.
When training data on the MIV, which is a sufficient dataset, Fig. 3(b) shows that the proposed model significantly outperforms all the baselines. We observe that the performance obtained by RETAIN without using medical ontologies is better than that of the other baselines over MIV. The underlying reason may be that the nextadmission diagnosis prediction is more sensitive to the diagnosis cooccurrence and the sequential positions of the visits in sufficient data. Overall, the results demonstrate that the proposed model balances medical ontology and diagnosis cooccurrence over both insufficient and sufficient EHR data to further improve prediction performance.
IvE Ablation Study
We performed a detailed ablation study to examine the contributions of the model’s components to the prediction task. There are three components: (Transformer) the transformer blocks to learn the patient journey from the embedded visits; (Ontology) the external medical ontology integrated into SETOR, and (ODE Representations) the LoS and interval encodings to be added to the learned visit embeddings.

w/o JTrans: remove the patient journey transformer blocks from the proposed model;

w/o Ontology: remove the ontological representation from the proposed model;

w/o LoS: remove LoS ODE representation;

w/o Interval: remove the interval ODE representation;

w/o ODE: replace two ODE representations with position embedding.
Ablated Transformers
JTransformer is responsible for learning dependencies of sequential visits in a patient journey with admission intervals and length of stay in each visit. We conducted a group of experiments to analyze the contribution of this component to SETOR over two datasets. From Tab. IV, we observe that the full complement of SETOR achieved superior accuracy to the ablated models. Specifically, we note that the JTransformer (w/o JTrans) contributes the highest accuracy to the predictive task over MIII. Specifically, the accuracy improves by 4.14% and 4.89% when and 20, respectively. On MIV dataset, the performance of sequential diagnosis prediction is improved further with component of JTransformer. The accuracy increases by 9.86% and 5.49% when and 20, respectively. The underlying reason is that transformer blocks are providing significant amount of additional parameters and thus capacity. Thus, the prediction performance is significantly improved.
Ablated Representations
In the paper, representations consists of ontological, LoS and interval representation. From Tab. IV, we see that the components of various representations have contributions to the proposed model SETOR, though the contributions are no more than that of the JTransformer. Specifically, we observe that the ontological representation (w/o Ontology) contributes the highest accuracy to the predictive task over MIII, which gives us confidence in using external medical ontologies to enhance the patient journey representations without sufficient data. Moreover, it is clear that the component of ODE representations provides valuable information for the performance of sequential diagnoses prediction over MIV, which implies the irregular intervals and discharge states play more important roles with sufficient data. As shown in the ablation study (Tab. IV), the ontological and ODE representations contribute most to the predictive tasks, no matter the training data is sufficient or not, e.g., 1.18% lift of Acc@20 on MIV and 0.56% lift of Acc@20 on MIII. Also as shown in Fig. 5, the embeddings produced by our proposed model shows the great inseparability of disease categories.
Ablation  MIII (%)  MIV (%)  

Acc@5  Acc@20  Acc@5  Acc@20  
SETOR  31.18  62.36  47.72  86.26 
w/o JTrans  27.04  57.47  37.86  70.77 
w/o Ontology  30.39  61.80  47.65  86.07 
w/o LoS  30.74  61.96  47.59  86.20 
w/o Interval  30.95  61.84  47.57  86.18 
w/o ODE  30.45  61.93  47.38  85.08 
IvF Interpretable Representation Analysis
To qualitatively demonstrate the interpretability of the learned medical code embeddings by all the predictive models on the MIII dataset, we randomly select 2000 medical codes and then plot on a 2D space with tSNE [37] as shown in Fig. 5 and Fig. 6. Each dot represents a diagnosis code, and the color of the dots represents the 18 disease categories in CCS multilevel hierarchy^{3}^{3}3https://www.hcupus.ahrq.gov/toolssoftware/ccs/AppendixCMultiDX.txt and the text annotations represent the detailed disease categories. Ideally, the dots with the same color should be in the same cluster and there are margins among different clusters.
From Fig. 5, we can observe that SETOR learns interpretable disease representations that are in accord with the hierarchies of the given medical ontology , and obtains the 18 nonoverlapping clusters. Specifically, for category of “Residual codes; unclassified; all E codes [259. and 260.]”, we observe the medical codes in this category are closely clustered together, with large margin to other categories. The embedding results of medical codes in this category are harmony to CCS multilevel hierarchy, as category of “Residual codes; unclassified; all E codes [259. and 260.]” has not subcategory in CCS ontology. However, we note that the medical codes in category of “Injury and poisoning” are scattered in larger area. The underlying reason is that there are 12 subcategories under this category. Thus, it is demonstrated that our proposed model SETOR learns meaningful and semantic representations for medical codes, which have practical interpretability.
As shown in Fig. 6, compared with SETOR, MMORE, KAME and GRAM are comparative baseline models, as those integrate medical ontology to predict sequential diagnoses. We observe the three baseline models learn reasonably interpretable diagnosis representations for several categories, as there is a large number of dots overlapping in the center part of Fig. 6b, 6c, and 6c. It is clear that the medical codes in category of “Residual codes; unclassified; all E codes [259. and 260.]” are well clustered. But for the medical codes in category of “Injury and poisoning”, their learned embeddings do not have clear margins to other categories. Fig. 6e and 6f suggest that models not using medical ontologies cannot easily learn interpretable representations. In addition, the predictive performance of SETOR is much better than that of MMORE, KAME, and GRAM shown in Tab. III, which proves that the proposed model does not affect the interpretability of medical codes. Moreover, it effectively improves the prediction accuracy.
V Conclusion
Although the recent approaches have achieved promising performance on sequential diagnosis prediction task, they are still facing two major challenges, such as, how to effectively model irregular and temporal properties in EHR data and data insufficiency in healthcare information systems. In this paper, we propose an endtoend transformerbased model, SETOR, to integrate medical ontology with visit information to mitigate the problem of data insufficiency, and to utilize neural ODE representations to learn hidden states for irregular intervals and visit discharges to effectively capture the irregular and temporal dependencies in EHR data. Although the proposed approach is focused on Electronic Health Record in healthcare domain, the core modules in this paper, e.g., ODE representations and ontological encoding, can be easily adapted into many other domains, such as, irregularlysampled time series. An experiment is conducted to show that SETOR outperforms baselines with both sufficient and insufficient data. The representations of medical codes are visualized to illustrate the interpretability of the proposed model. The experimental results on the two realworld medical datasets demonstrate the effectiveness, robustness, and interpretability of the proposed model.
Acknowledgment
This work was supported in part by the Australian Research Council (ARC) under Grant LP180100654 and DE190100626.
References

[1]
B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep ehr: a survey of recent advances in deep learning techniques for electronic health record (ehr) analysis,”
IEEE J Biomed Health Inform, vol. 22, no. 5, pp. 1589–1604, 2018.  [2] L. Song, C. W. Cheong, K. Yin, W. K. Cheung, B. C. Fung, and J. Poon, “Medical concept embedding with multiple ontological representations.” in IJCAI, 2019, pp. 4613–4619.
 [3] E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun, “Gram: graphbased attention model for healthcare representation learning,” in SIGKDD. ACM, 2017, pp. 787–795.
 [4] F. Ma, Q. You, H. Xiao, R. Chitta, J. Zhou, and J. Gao, “KAME: Knowledgebased attention model for diagnosis prediction in healthcare,” in CIKM. ACM, Oct. 2018, pp. 743–752.

[5]
X. Peng, G. Long, T. Shen, S. Wang, and J. Jiang, “Selfattention enhanced
patient journey understanding in healthcare system,” in
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
. Springer, 2020, pp. 719–735.  [6] W. Chen, S. Wang, G. Long, L. Yao, Q. Z. Sheng, and X. Li, “Dynamic illness severity prediction via multitask rnns for intensive care unit,” in ICDM. IEEE, 2018, pp. 917–922.
 [7] E. Choi, M. T. Bahadori, E. Searles, C. Coffey, M. Thompson, J. Bost, J. TejedorSojo, and J. Sun, “Multilayer representation learning for medical concepts,” in SIGKDD. ACM, 2016, pp. 1495–1504.
 [8] X. Zhang, B. Qian, X. Li, J. Wei, Y. Zheng, L. Song, and Q. Zheng, “An interpretable fast model for predicting the risk of heart failure,” in Proceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 2019, pp. 576–584.
 [9] X. Peng, G. Long, T. Shen, S. Wang, J. Jiang, and C. Zhang, “Bitenet: Bidirectional temporal encoder network to predict medical outcomes,” in ICDM. IEEE, 2020, pp. 412–421.

[10]
F. Ma, R. Chitta, J. Zhou, Q. You, T. Sun, and J. Gao, “Dipole: Diagnosis prediction in healthcare via attentionbased bidirectional recurrent neural networks,” in
SIGKDD. ACM, Aug. 2017, pp. 1903–1911.  [11] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart, “Retain: An interpretable predictive model for healthcare using reverse time attention mechanism,” in NeurIPS, 2016, pp. 3504–3512.
 [12] K. Jha, Y. Wang, G. Xun, and A. Zhang, “Interpretable word embeddings for medical domain,” in ICDM. IEEE, 2018, pp. 1061–1066.
 [13] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv:1301.3781, 2013.
 [14] Z. Qiao, S. Zhao, C. Xiao, X. Li, Y. Qin, and F. Wang, “Pairwiseranking based collaborative recurrent neural networks for clinical event prediction,” in IJCAI, 2018.

[15]
S. Hochreiter and J. Schmidhuber, “Long shortterm memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.  [16] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using rnn encoderdecoder for statistical machine translation,” arXiv:1406.1078, 2014.
 [17] G. Long, T. Shen, Y. Tan, L. Gerrard, A. Clarke, and J. Jiang, “Federated learning for privacypreserving open innovation future on digital health,” arXiv preprint arXiv:2108.10761, 2021.
 [18] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “Ernie: Enhanced language representation with informative entities,” arXiv preprint arXiv:1905.07129, 2019.
 [19] W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang, “Kbert: Enabling language representation with knowledge graph.” in AAAI, 2020, pp. 2901–2908.
 [20] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, “Neural ordinary differential equations,” in NeurIPS, 2018, pp. 6571–6583.
 [21] Y. Rubanova, R. T. Chen, and D. K. Duvenaud, “Latent ordinary differential equations for irregularlysampled time series,” in NeurIPS, 2019, pp. 5320–5330.
 [22] E. Choi, C. Xiao, W. Stewart, and J. Sun, “Mime: Multilevel medical embedding of electronic health records for predictive healthcare,” in NeurIPS, 2018, pp. 4547–4557.
 [23] F. Ma, J. Gao, Q. Suo, Q. You, J. Zhou, and A. Zhang, “Risk prediction on electronic health records with prior medical knowledge,” in SIGKDD. ACM, Jul. 2018, pp. 1910–1919.
 [24] I. M. Baytas, C. Xiao, X. Zhang, F. Wang, A. K. Jain, and J. Zhou, “Patient subtyping via timeaware lstm networks,” in SIGKDD, 2017, pp. 65–74.
 [25] X. Peng, G. Long, S. Pan, J. Jiang, and Z. Niu, “Attentive dual embedding for understanding medical concepts in electronic health records,” in IJCNN. IEEE, 2019, pp. 1–8.
 [26] X. Peng, G. Long, T. Shen, S. Wang, J. Jiang, and M. Blumenstein, “Temporal selfattention network for medical concept embedding,” in ICDM. IEEE, 2019, pp. 498–507.
 [27] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017, pp. 5998–6008.
 [28] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured selfattentive sentence embedding,” arXiv:1703.03130, 2017.
 [29] X. Cai, J. Gao, K. Y. Ngiam, B. C. Ooi, Y. Zhang, and X. Yuan, “Medical concept embedding with timeaware attention,” in IJCAI, 2018, pp. 3984–3990.
 [30] E. Dupont, A. Doucet, and Y. W. Teh, “Augmented neural odes,” in NeurIPS, 2019, pp. 3140–3150.
 [31] J. Devlin, M.W. Chang, K. Lee, and K. Toutanova, “Bert: Pretraining of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
 [32] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
 [33] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
 [34] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Liwei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “Mimiciii, a freely accessible critical care database,” Scientific data, vol. 3, p. 160035, 2016.
 [35] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark, “Mimiciv (version 0.4),” PhysioNet, 2020.
 [36] M. D. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
 [37] L. v. d. Maaten and G. Hinton, “Visualizing data using tsne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
Comments
There are no comments yet.