UniHPF : Universal Healthcare Predictive Framework with Zero Domain Knowledge

Despite the abundance of Electronic Healthcare Records (EHR), its heterogeneity restricts the utilization of medical data in building predictive models. To address this challenge, we propose Universal Healthcare Predictive Framework (UniHPF), which requires no medical domain knowledge and minimal pre-processing for multiple prediction tasks. Experimental results demonstrate that UniHPF is capable of building large-scale EHR models that can process any form of medical data from distinct EHR systems. Our framework significantly outperforms baseline models in multi-source learning tasks, including transfer and pooled learning, while also showing comparable results when trained on a single medical dataset. To empirically demonstrate the efficacy of our work, we conducted extensive experiments using various datasets, model structures, and tasks. We believe that our findings can provide helpful insights for further research on the multi-source learning of EHRs.


page 1

page 2

page 3

page 4


Unifying Heterogenous Electronic Health Records Systems via Text-Based Code Embedding

Substantial increase in the use of Electronic Health Records (EHRs) has ...

BiteNet: Bidirectional Temporal Encoder Network to Predict Medical Outcomes

Electronic health records (EHRs) are longitudinal records of a patient's...

TabText: a Systematic Approach to Aggregate Knowledge Across Tabular Data Structures

Processing and analyzing tabular data in a productive and efficient way ...

MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare

Deep learning models exhibit state-of-the-art performance for many predi...

Pre-training transformer-based framework on large-scale pediatric claims data for downstream population-specific tasks

The adoption of electronic health records (EHR) has become universal dur...

Comparing Rule-based, Feature-based and Deep Neural Methods for De-identification of Dutch Medical Records

Unstructured information in electronic health records provide an invalua...

1 Introduction

Figure 1: An illustration of the conventional EHR system-specific framework. A typical EHR predictive model framework involves domain-knowledge-based pre-processing for each medical center’s schema, which requires schema-specific and code system-specific feature engineering. This results in input features being dependent on each hospital and incompatible with models among different hospitals.

Patient medical records are accumulated regularly in the form of Electronic Health Records (EHR), enabling quality treatment based on patients’ medical history. The abundance of EHR has opened the possibility of developing data-driven models, which aim to increase the quality of medical predictions.

However, typical EHR datasets do not follow a single data format since each hospital stores EHR data according to their own needs. Specifically, different EHR systems adopt different medical code standards (e.g., ICD-9, ICD-10, raw text), and use distinct database schemas to store patient records [13, 12, 19].

Such heterogeneity is problematic because it acts as a barrier towards EHR model development.

In particular, when using patient clinical data, each hospital must employ its own data experts to rigorously pre-process EHR. Figure 1 shows a typical framework for EHR-system-driven predictive models. In addition, discrepancies in medical codes and schemas prevent multiple healthcare organizations from conducting multi-source learning, such as further training a model that has been previously trained on a distinct EHR database (

i.e., transfer learning

) or developing a model with EHR data pooled from multiple hospitals (i.e., pooled learning). Thus, hospitals cannot fully utilize the abundance of EHR data collected by multiple healthcare institutions. In order to resolve this issue of heterogeneity, a unified framework is required.

Previous studies have attempted to overcome this dissimilarity in several ways. For instance, Rajkomar et al. [20] used FHIR [16], a type of Common Data Model (CDM) to manually standardize distinct EHR data into a single format. AutoMap [26] learned to apply direct code mapping between different EHR systems in a self-supervised manner. In addition, DescEmb [11] aimed to overcome the heterogeneity of medical codes by utilizing clinical descriptions linked to each code, partially enabling multi-source learning. Despite their progress, these approaches only provide a partial solution to EHR heterogeneity because they still necessitate EHR system-specific healthcare expertise such as selecting meaningful features for each prediction task, which is often expensive, time-consuming, and not always optimal.

To address these challenges, we propose Universal Healthcare Prediction Framework (UniHPF). Our framework presents a method for embedding any form of EHR systems for prediction tasks without requiring domain-knowledge-based pre-processing, such as medical code mapping and feature selection. As such, UniHPF enables the build-up of a large-scale healthcare model by utilizing massive amounts of data collected from multiple hospitals. The main contributions of this study can be summarized as follows:

  • [leftmargin=5.5mm, topsep=0pt]

  • We propose UniHPF, a neural-network-based universal framework that facilitates the learning of any EHR data without relying on domain knowledge. To the best of our knowledge, this is the first approach that handles various heterogeneous EHRs with a single unified framework, without requiring any prior knowledge of each distinct EHR.

  • Our method achieves comparable performance on single EHR dataset tasks, while consistently showing superior performance on pooled learning and transfer learning which require a model to comprehend various heterogeneous EHR systems. This implies that our framework serves as a guideline for building large-scale EHR models that can process any form of EHR systems from multiple sites.

  • To empirically demonstrate the efficacy of our work, we conducted extensive experiments with various datasets, model structures, and prediction tasks. We believe that our findings can provide helpful insights for further research on the multi-source learning of EHR.

2 Related work

Domain-knowledge-based predictive models.

  Previous prediction tasks based on EHR utilize architectures such as recurrent neural networks 

[15, 2]

, convolutional neural networks 

[18], and transformer-based models [23, 22, 4]. Additionally, several studies on EHR-based prediction have attempted to fully utilize medical domain knowledge. MIMIC-Extract [25] performs domain-knowledge-based feature engineering, such as grouping semantically similar concepts into clinical taxonomy and standardizing measuring units. Based on these heavily hand-crafted features, McDermott et al. [17] proposed a benchmark for ten healthcare predictive tasks and reported their prediction performances. Similarly, Graph Convolutional Transformer (GCT) [4] utilizes domain knowledge by learning and employing the hidden graphical structure of EHR.

Overall, many previous works invested considerable amount of time and effort into applying medical domain knowledge. As a result of being specialized in their own respective settings, they work only for a specific dataset, not for general EHR systems which are diverse and heterogeneous.

Resolving heterogeneous EHR systems.  Researchers have been working on alternatives to overcome heterogeneity in EHR, which is considered as one of the main challenges in the modeling of medical data. CDM is an approach that manually maps different EHR systems into a standardized format (e.g., FHIR [16]), which has been reported to facilitate satisfactory results in multiple prediction tasks [20]. However, the standardization of EHR formats requires extensive domain knowledge and intensive manual work, which could lead to human bias, making it impractical to integrate a large number of EHR systems.

AutoMap [26]

conducts medical code mapping via self-supervised learning with a predefined medical ontology. This study aimed to develop a solution for the current lack of a unified EHR system through the direct code-to-code mapping of two different medical institutions. However, because medical ontology is essential for using AutoMap, it is still not suitable for modeling code systems that do not have a standardized ontology.

In another study, a description-based embedding method called DescEmb [11] was proposed to significantly reduce the need for aligning different code systems. Specifically, instead of dealing with medical codes directly, DescEmb exploits the text descriptions corresponding to each medical code, demonstrating its efficacy in multi-source learning, such as transfer and pooled learning. Moreover, because text descriptions inherit the semantic meanings of their corresponding code, DescEmb showed promising performance in multiple clinical prediction tasks. However, despite its merit,

DescEmb still requires the selection of features specific to each EHR system, which requires system-specific expertise.

3 Methodology

3.1 Structure of Electronic Health Records

Here, we describe and summarize the EHR structure and notations used throughout this paper.

In typical EHR data, each patient can be represented as a sequence of medical events , where is the total number of events throughout the entire patient visit history. The i-th medical event of a patient can be expressed as a set of event-associated features . Each feature can be seen as a tuple of a feature name and its value , where and are each a set of unique feature names (e.g., “diagnosis code”, “drug name”, “drug dosage”, ) and feature values (e.g., “401.9”, “vancomycin”, “10.0”, ), respectively.

In addition, each medical event has its corresponding event type which denotes the type of the event (e.g., lab test”, “prescription”, “procedure”, ). Lastly, since the recorded time is also provided with , we can measure the time interval between and .

Figure 2: Overview of UniHPF. On the top, a patient’s series of medical events occur over time. Each medical event is made up of event-related features , including feature names and their values. These features, prepended with event type , are converted to corresponding descriptions, and tokenized into a sequence of sub-words. Then an event encoder converts the sequence (i.e., event input) to an embedding , which is passed to the event aggregator , which then makes a prediction .

3.2 Universal Healthcare Predictive Framework

In this section, we present UniHPF, a universal framework for EHR-based prediction based on the following three principles, and describe how to implement each principle: (1) text-based embedding, (2) employing the entire features of EHR and (3) medical event aggregation. The overall architecture of UniHPF is depicted by Fig. 2.

Text-based embedding.  A conventional EHR embedding method starts by assigning a unique embedding for each element in via a linear map (i.e., lookup table)  [3, 23, 24, 17, 20], so that

can be converted to a vector

, typically followed by pooling multiple feature values () to obtain , the embedding of 222Previous EHR embedding methods do not typically use the feature name .

This conventional embedding, however, usually requires a different for each medical institution due to the heterogeneity, namely each institution using different ’s. For example, MIMIC-III [13]

, an open-source EHR data, uses the ICD-9 diagnosis codes for recording diagnostic information, while eICU 

[19], another open-source EHR data, uses in-house diagnosis codes. Therefore, the conventional embedding is not the most suitable foundation on which to build a universal EHR framework.

DescEmb [11] proposed to resolve this issue by suggesting a text-based embedding, where hospital-specific feature values are first converted to textual descriptions (e.g., “401.9” “unspecified essential hypertension”) 333generic feature values are left as is, such as “vancomycin” and “10.0”, then a text encoder (e.g., BERT [8]) paired with a sub-word tokenizer (e.g., byte pair encoding [21]) is used to obtain . With this approach, the model can learn the language of the underlying medical text rather than memorizing a unique embedding for each hospital-specific feature value, thereby overcoming the heterogeneity as the same text encoder can be used for all institutions that use the same language.

We extend the previous approach by applying the text-based embedding philosophy to event types and feature names , in addition to feature values , as follows:


where is a sub-word tokenizer, is an event encoder that takes a sequence of sub-word tokens and returns , and is a special token for time intervals. Note that can be a pre-trained language model as in DescEmb, or a randomly initialized Transformer encoder, or even a single-layer RNN.

Employing the entire features of EHR.

To develop a universal predictive framework, in addition to the heterogeneity, we must consider the schema heterogeneity, namely each medical institution using different database schema. When developing a conventional predictive model, engineers and medical domain experts are typically involved to define , a subset of task-specific features among according to each EHR system. This process must be carried out repeatedly whenever they encounter a different EHR schema. Moreover, in multi-source learning, engineers and medical domain experts must select and match compatible features among distinct EHR systems. For instance, in the Lab event of eICU, the feature named “labResult” should be paired with the “VALUENUM” feature in MIMIC-III’s LABEVENTS event. Assessing database schemas of multiple sources and matching compatible features, although inevitable in a conventional approach, is time-consuming and prone to human errors.

To avoid this costly procedure, our framework exploits the entire features of medical events, effectively resolving the schema heterogeneity. As described in Eq. 1, the entire features in medical events are embedded into one unified embedding . From this approach, engineers no longer need to be concerned about feature selection since all features are used. Additionally, in multi-source learning, our framework is not constrained by the features that are present in each schema since both the name and the value of the feature are used. A formal comparison between previous and our approach to obtaining is provided below to highlight the differences:

where pool is typically implemented as concatenation or summation of the elements. Note that we omitted the time interval in all equations to emphasize the fact that UniHPF differs from previous approaches in that it is the only approach to exploit all available information in a medical event: event type, all event names and all event values. Therefore, UniHPF provides a general solution applicable to any EHR system with different schema, making it schema-agnostic without requiring medical domain knowledge.

Medical event aggregation.  To take advantage of the characteristics of EHR, where consists of a sequence of and each consists of a set of , we design a hierarchical model consisting of the event encoder , and the event aggregator .

After each is converted to according to Eq. 1, we can obtain , the vector representation of as follows:


where is an embedding function that takes a sequence of event embeddings and returns . Note that can be implemented with any sequence encoder, such as a Transformer encoder or a single-layer RNN. Then, feeding

through a softmax layer (sigmoid layer if binary prediction) will give us the final prediction


Note that it is possible to obtain by employing a flattened model architecture rather than a hierarchical one as follows:


where sub-word tokens from all features of all medical events are passed to the sequence model at the same time. This alternative has the advantage of being able to directly associate features from different medical events (e.g., associate and via self-attention if is a Transformer), as opposed to the hierarchical architecture where only indirect connection is allowed. This, however, comes at a steep price of having to digest a significantly longer sequence of sub-word tokens, which not only increases computational burden, but also the hypothesis space to explore. We test both architectures in the experiments and demonstrate that the hierarchical approach, which reflects the characteristics of EHR data, indeed outperforms the flattened approach.

4 Experiments

4.1 Experimental Settings

Datasets.  We draw on three publicly available datasets; MIMIC-III [13], MIMIC-IV [12], and eICU [19]. The MIMIC-III database consists of clinical data of over 40,000 patients admitted to intensive care units (ICU) at the Beth Israel Deaconess Medical Center from 2001 to 2012. MIMIC-IV is an updated version of MIMIC-III that includes new sources of data, admission date shifting, and extended period of records collected from 2008 to 2019. eICU consists of ICU records from multiple US-based hospitals, totaling up to 140,000 unique patients admitted between 2014 and 2015.

All three datasets contain patient medical events including lab tests, prescriptions, and input events (e.g., drug injection), each event marked with timestamps. MIMIC-III and MIMIC-IV share the same code system with similar schemas, whereas eICU has a completely distinct code system and schema. For example, the prescription table in MIMIC-III employs the ICD code system to record patient drug intake and injection information. In contrast, eICU does not record drug information based on a structured code system, but rather represents it in a raw text format. Detailed data pre-processing is provided in github repository [9].

Data pre-processing and split.  For the sake of comparability, we built patients cohorts from MIMIC-III, MIMIC-IV and eICU databases based on the following criteria: patients over the age of 18 years who remained in the ICU for over 24 hours. Then, we exclusively consider the first ICU stay during a single hospital stay, and remove any ICU stays with fewer than five medical events. Within each ICU stay, we restrict our samples to the first 12 hours of data, and remove features that occur fewer than five times in the entire dataset. For applying UniHPF to any EHR datasets, only two pre-processing steps are necessary, which do not involve any domain knowledge. First, we drop features whose values only consist of integers. This automatically leads to using all continuous-valued features (e.g., lab test result) and textual features (e.g., lab test name), while features such as patient ID are removed. Second, we split numeric values digit-by-digit and assign a special token for each digit place, namely digit place embedding, which was first introduced in DescEmb [11]. More details about data pre-processing can be found at our github repository  [9] .

To ensure reliable experiments and analysis, we split the dataset into training, validation and test sets according to 8:1:1 ratio in a stratified manner for each target label. All experiments were conducted with five random seeds and we report the mean performance. The dataset statistics are listed in Table 1.

max width=1 Statistic MIMIC-III eICU MIMIC-IV No. of Observations 38040 98904 65511 No. of ICU stays 38040 98904 65511 No. of Unique codes 10385 6302 9565 No. of Unique subwords 2235 1585 2724 Mean No. of events per sample 98.47 538.89 116.32 Mean of code length per event 18.13 16.82 21.03 Mean of subword length per event 50.28 47.91 69.75

Table 1: Prediction datasets summary statistics

Baselines and implementation.  We use the following baseline models to evaluate the feasibility of UniHPF for our objective, namely schema-agnostic EHR embedding without medical domain knowledge. As there is no previous work, to our knowledge, that tackled exactly the same goal as ours, we modified well-known general-purpose EHR embedding frameworks. In addition, all models were provided with both and for a fair comparison with UniHPF.

  • [leftmargin=5.5mm, topsep=0pt]

  • SAnD*: This uses the conventional embedding, selected features , and the flattened architecture, similar in spirit to SAnD [23]. Note that feature embeddings from all medical events are directly fed to the sequence encoder instead of being pooled to obtain individual .

  • Rajkomar*: This uses the conventional embedding, entire features , and the hierarchical approach, similar in spirit to [20] except the CDM standardization. Note that feature embeddings from each are fed to to obtain individual , which is fed to .

  • DescEmb*: This uses the text-based embedding, selected features , and the hierarchical approach, similar in spirit to DescEmb [11].

For a fair comparison, and were both implemented with a randomly initialized 2-layer Transformer encoder, and a 4-layer Transformer encoder, making all models equivalent in terms of number of trainable parameters. (, , ). Further implementation details including the list of selected features 444For example, from the prescription event, we chose essential features such as drug name, drug volume, unit of measurement among all available features. are described in our github repository  [9].

Prediction tasks.  To evaluate our framework on a variety of healthcare predictive task types, we formulated a total of seven prediction tasks following McDermott et al. [17] based on individual ICU stays. All evaluations are evaluated with the area under the precision recall curve (AUPRC). Each task is defined as follows:

  1. [topsep=0pt]

  2. Mortality Prediction (Mort) (binary): A sample is labeled positive for mortality if the discharge state was “expired” within the prediction window of 48 hours.

  3. Length-of-Stay Prediction (LOS) (binary): There are two cases for length of stay prediction: whether a given ICU stay lasted longer than 3 days (LOS3), and whether it lasted longer than 7 days (LOS7).

  4. Readmission Prediction (Readm) (binary): Given a single ICU stay, we consider a sample readmitted if it is followed by another ICU stay during the same hospital stay.

  5. Final Acuity (Fi_ac) (multi-class): Predicting where the patient will be discharged at the end of their stay including the place of death.

  6. Imminent Discharge (Im_disch) (multi-class): Predicting whether the patient will be discharged within the prediction window of 48 hours and if so, where to be discharged.

  7. Diagnosis Prediction (Dx) (multi-label): Predicting all diagnosis (Dx) codes accumulated during the entire hospital stay. We group Dx codes into 18 Dx classes using Clinical Classification Software (CCS) for ICD-9-CM criteria [7].

4.2 Experimental Design

For validating UniHPF utility in multiple aspects, we designed a set of experiments: (1) single domain prediction, (2) pooled learning, and (3) transfer learning.

Single domain prediction.  In single domain prediction, all models are trained on a single dataset’s training set and tested on the same dataset’s test set (e.g., trained on eICU’s training set, tested on eICU’s test set). Based on these experiments, we intend to show that UniHPF can be utilized for single domain prediction even though it originally aims at multi-source learning. To provide more credibility, we additionally compare UniHPF with Benchmark [17]. Since Benchmark suggested an expert-designed feature-engineered prediction pipeline, comparing UniHPF with it can verify the effectiveness of our method, which does not involve any domain knowledge. Benchmark originally used lab test events and chart events, but in this work we use a modified Benchmark to use only the lab test events, and compare it with a modified UniHPF that also only uses lab test events. This is due to the explosive amount of chart events that occur within a 12-hour window (more than a thousand chart events), which we plan to handle in the future with computation-efficient sequence encoders such as efficient Transformers.

Pooled learning.  In order to utilize the abundance of EHR in prediction tasks, it is important to make use of data collected from multiple EHR systems. Such pooled learning enables the training of a model with diverse patient information, and finally can lead to more precise prediction. To show the capability of our framework in this scenario, we train the models on the pooled dataset from multiple sources, and evaluate them on each dataset’s test set. Note that imminent discharge and final acuity tasks are excluded because of incompatible label definitions between MIMIC and eICU.

Transfer learning.

  In reality, it is more likely that a single deep learning model is trained on a large-scale hospital dataset and then transferred to individual institutions. Such transfer learning provides an opportunity for small hospitals to benefit from large-scale trained models. In this scenario, each model is first trained on a source dataset and then directly evaluated (

i.e., zero-shot) or further trained (i.e., fine-tune) on a target dataset. Here, we introduce two extra baselines that can be used to automatically map different code systems between two EHR datasets: AutoMap and MUSE. AutoMap [26] is an automatic medical code mapping method, which aims to solve transfer learning by aligning two different code systems in a self-supervised manner. MUSE [6] is an unsupervised bilingual embedding mapping method, which was used as a baseline in [26]. Owing to the same reason as in pooled learning, imminent discharge and final acuity tasks are excluded in transfer learning.

Figure 3:

Comparison of single domain prediction performance. The data source used for training and evaluation is represented on each row. The y-axis describes prediction performance in terms of area under precision and recall curve (AUPRC), and x-axis represents the models. The error bars represent standard error (SE) and models with

indicates using only lab test events. Note that we do not report the MIMIC-IV result of Benchmark as it was not covered by [17].

4.3 Single Domain Prediction

The results of single domain prediction are shown in Fig. 3. First, we compare UniHPF with Benchmark [17] to see how absence of domain knowledge affects prediction performance. UniHPF generally shows higher performance than Benchmark [17] in most prediction tasks. This implies that it is possible to achieve better AUPRC without significant feature engineering.

Next, we compare all models that use lab tests, prescriptions, and input events. UniHPF shows comparable prediction performance to models using domain knowledge and conventional embedding (SAnD*, Rajikomar*, DescEmb*) except the readmission tasks on MIMIC-III and MIMIC-IV, for which all models fail to show decent AUPRC to begin with. In particular, a comparison between UniHPF and Rajkomar* suggests that it is unnecessary to assign unique embeddings for all feature names and values, and treating them as textual descriptions leads to comparable performance. In addition, a comparison between UniHPF and DescEmb* demonstrates that applying medical domain knowledge to select a subest of meaningful features does not necessarily lead to greater performance than simply using all features.

Overall, single domain prediction results show that UniHPF achieves comparable performance even without relying on medical domain knowledge, and by simply using all features as textual descriptions.

Figure 4: Pooled learning results. The data source used for evaluation is represented on each row. The y-axis describes area under precision and recall curve (AUPRC), and x-axis represents the models. The blue dashed line separates models into conventional embedding models(left- SAnD*, Rajkomar*) and text-based embedding models(right- DescEmb*, UniHPF). Dot colors indicate the source datasets used for training. Note that “Single” refers to the same data source as the evaluation dataset (i.e., single domain prediction). To effectively show the efficacy of pooled learning, we drew black arrows to indicate performance gain or loss from “Single” to “MIMIC-III+MIMIC-IV+eICU”. The blue dashed line separates models into conventional embedding models(left- SAnD*, Rajkomar*) and text-based embedding models(right- DescEmb*, UniHPF).

4.4 Pooled Learning

The results of pooled learning are shown in Fig. 4. For text-based embedding models (DescEmb* and UniHPF), the results when training on the pooled dataset from all the three sources (red dots of C and D for each plot in Fig. 4) consistently show higher performances than the single domain predictions (yellow dots of C and D for each plot in Fig. 4). In contrast, in the case of conventional embedding models (SAnD* and Rajkomar* the same dots of A and B for each plot in Fig. 4), the performances generally decrease. We speculate this result comes from the fact that MIMICs and eI CU have completely different code systems and schemas to each other, so a model must learn the clinical semantics of different code systems to fully take advantage of pooled learning. That is, since MIMICs and eICU do not share any codes, training conventional embedding models on this pooled dataset does nothing but expand the number of required embeddings for each feature name and value, which prevents the model from taking advantage of larger training data (i.e., pooled dataset). On the contrary, text-based embedding models can utilize the rich volume of data integrated from different sources because the sub-words of the medical descriptions will be shared even among completely distinct EHRs.

In addition, within text-based embedding models, UniHPF outperforms DescEmb* in most cases when all three data sources are pooled. Also, on the same pooled dataset, UniHPF consistently shows improved performances compared to the single domain prediction, whereas the performances of DescEmb* decrease in some cases. This result implies that UniHPF has a better capability to capture underlying semantics of distinct EHR sources than DescEmb*, by utilizing all available information in a medical event.

max width=1.0 Zero-shot Fine-tune MIMIC-III eICU SAnD* AutoMap MUSE Rajkomar* DescEmb* UniHPF SAnD* AutoMap MUSE Rajkomar* DescEmb* UniHPF Mort 0.048 0.044 0.039 0.035 0.135 0.135 0.141(-0.024) 0.105(+0.039) 0.138(+0.081) 0.137(-0.035) 0.162(0) 0.175(+0.006) LOS3 0.452 0.465 0.463 0.458 0.503 0.507 0.575(-0.01) 0.541(+0.014) 0.573(+0.052) 0.587(+0.002) 0.578(-0.006) 0.589(+0.006) LOS7 0.178 0.186 0.178 0.192 0.264 0.258 0.259(-0.027) 0.263(+0.052) 0.28(-0.001) 0.293(+0.008) Readm 0.166 0.165 0.154 0.169 0.320 0.333 0.399(-0.008) 0.398(+0.038) 0.351(-0.053) 0.28(-0.122) 0.415(+0.006) Dx 0.277 0.289 0.629 0.646 0.672(-0.007) 0.435(-0.157) 0.675(+0.07) 0.688(-0.006) 0.682(-0.005) 0.688(-0.001) eICU MIMIC-III Mort 0.048 0.048 0.048 0.238 0.245 0.254(-0.009) 0.066(-0.054) 0.32(-0.006) 0.299(+0.008) 0.333(+0.006) LOS3 0.493 0.482 0.5 0.492 0.533 0.544 0.643(-0.019) 0.612(+0.012) 0.652(+0.049) 0.664(+0.001) 0.659(-0.006) 0.663(-0.003) LOS7 0.219 0.221 0.292 0.308 0.333(-0.032) 0.333(+0.036) 0.35(-0.015) 0.367(+0.001) 0.379(+0.013) Readm 0.061 0.063 0.06 0.068 0.049 0.055 0.076(-0.01) 0.065(-0.012) 0.077(-0.004) 0.09(-0.003) 0.066(-0.002) Dx 0.533 0.536 0.639 0.647 0.758(+0.001) 0.648(-0.054) 0.751(+0.049) 0.76(0) 0.754(-0.006) 0.765(+0.006) MIMIC-III MIMIC-IV Mort 0.018 0.024 0.024 0.014 0.222 0.228 0.284(-0.003) 0.244(+0.076) 0.326(+0.009) 0.301(+0.009) 0.309(-0.006) LOS3 0.402 0.411 0.404 0.389 0.527 0.536 0.592(-0.012) 0.582(+0.051) 0.614(-0.022) 0.607(+0.001) 0.654(+0.006) LOS7 0.172 0.166 0.164 0.267 0.279 0.283(-0.034) 0.233(-0.004) 0.295(+0.047) 0.288(-0.043) 0.319(+0.006) 0.326(-0.002) Readm 0.08 0.082 0.082 0.085 0.082 0.095 0.109(-0.014) 0.093(-0.016) 0.114(+0.007) 0.131(+0.016) 0.118(+0.013) 0.121(+0.003) Dx 0.638 0.624 0.778 0.788 0.825(-0.007) 0.749(-0.032) 0.832(+0.051) 0.834(-0.002) 0.828(-0.001) 0.841(+0.007) MIMIC-IV MIMIC-III Mort 0.043 0.037 0.038 0.044 0.139 0.146 0.284(+0.021) 0.255(+0.143) 0.324(-0.002) 0.302(+0.011) 0.324(-0.003) LOS3 0.494 0.495 0.508 0.490 0.572 0.586 0.651(-0.011) 0.646(+0.043) 0.656(-0.007) 0.656(-0.009) 0.66(-0.006) LOS7 0.229 0.196 0.303 0.280 0.328(+0.031) 0.369(+0.004) 0.367(+0.001) Readm 0.049 0.057 0.059 0.062 0.062 0.070 0.063(-0.023) 0.072(-0.005) 0.079(-0.002) 0.095(+0.002) 0.079(+0.011) 0.062(+0.001) Dx 0.5 0.495 0.710 0.711 0.764(+0.007) 0.754(+0.052) 0.75(-0.01) 0.761(+0.002)

†: standard deviation


Table 2: AUPRCs of zero-shot test and fine-tune test results on five prediction tasks. For both zero-shot and fine-tune, the best results are written in boldface for each row. When fine-tuning, we additionally reported the performance difference with its single domain prediction.

4.5 Transfer Learning

The results of the transfer learning are presented in Table 2. Specifically, in the case of zero-shot, we can see that the code-based embedding methods (SAnD*, AutoMap, MUSE, and Rajikomar*) consistently show inferior performances compared to the text-based embedding methods (DescEmb* and UniHPF), excluding the readmission task in MIMIC-III. This again shows text-based embedding is more advisable to construct a unified healthcare framework than conventional embedding because of its shared sub-word vocabulary in the textual descriptions of features. In addition, UniHPF generally exhibits the best performance for prediction tasks across various transfer scenarios. This also shows that using all available information (UniHPF) is more helpful for learning the semantics of medical code descriptions rather than selecting and matching specific features (DescEmb*), as also mentioned in Sec. 4.4.

In the fine-tune scenario, the results show the same trends as zero-shot test, where UniHPF outperforms the other methods in most cases. In addition, compared to single domain prediction performance, we can see that UniHPF mostly benefits from the pre-trained source dataset.

4.6 Qualitative analysis for features in UniHPF

We have seen so far that UniHPF was able to demonstrate quantitatively superior, or at least comparable predictive performance to all baselines for multiple prediction tasks, three EHR datasets, and three learning scenarios. In this section, we provide a qualitative case study to see that UniHPF is not only reporting good AUPRC numbers, but it is also learning actually meaningful medical knowledge.

In order to see which features were significant in predictive tasks, we accumulated the gradient of back-propagation of each event at the event encoder . We followed the feature importance calculation method of DescEmb [11] on the mortality prediction with MIMIC-III and eICU. We hypothesize that the larger the gradient, the more impactful the features. The gradients for each event were tallied by the main feature of its corresponding event type (e.g., lab test name in the lab test event, or drug name in the prescription event). We analyzed the top 100 important features in MIMIC-III and eICU, where the top 15 important features are provided in Table 3 in descending order. Within the top 100 features, we examined the features shared by both DescEmb* and UniHPF to show that UniHPF still utilizes meaningful features even without a careful feature selection process. As a result, it turns out that both models share 87 and 79 out of the top 100 features in MIMIC-III and eICU, respectively, which means that UniHPF can figure out which features are significant for the predictive tasks without explicit guidance from human experts. Additional qualitative experiments are also provided in our github repository  [9].

max width=0.65 MIMIC-III eICU alendronate sodium po anf / ana oxycodone sustained release po d5w c bicarb morphine sulfate oral soln. po pantoprazole protonix furosemide lasix 500 / 100 vancomycin in ivpb acetaminophen - iv nss w / versed / fent vancomycin hcl rocuronium iv Norepinephrine cisatracurium alpha - fetoprotein oxycodone-acetaminophen 325mg pentamidine isethionate iv vitamin d oral muItivitamin -12 i v morphine 250 mg sodium chlorid heparin fIush port 10units/mI norepinephrine bitartrate heparin fIush 5000 units/mI rocuronium ivf infused ceftazidime famotidine pepcid iv push acetaminophen - iv 3 % sodium chloride ivf infused timolol maleate 0. 25 docusate sodium per ng tube ranitidine prophylaxis amiodarone bolus ivpb

Table 3: Top 15 important features in mortality prediction of UniHPF trained on the pooled dataset (MIMIC-III+eICU). We accumulated the gradients for each event at the event encoder , and ranked them in descending order.

5 Conclusion

In this paper, we proposed a universal healthcare prediction framework, UniHPF, which enables multi-source learning by solving EHR heterogeneity of code and schema simultaneously, without medical domain knowledge or pre-processing.

Our single domain prediction results showed UniHPF is able to achieve comparable, if not superior performance to all baselines without relying on domain expertise in pre-processing data or feature selecting. We also demonstrate the robustness and efficiency of our framework through pooled learning experiments, without site-specific data harmonization. Finally, our results showed that the code-agnostic and schema-agnostic properties aid in improving transfer learning performance in both zero-shot and fine-tune settings. Owing to this efficacy, we believe UniHPF can act as a cornerstone for large-scale model training with multiple EHR sources.

Limitations.   Although it showed promising results, UniHPF is not without its limitations. We used only the subset of EHR events (lab tests, prescriptions, and input events) even though UniHPF is able to utilize any EHR data, due to the computational limits. In other words, we could not process some EHR events such as “chart event” since the length of the event sequence grows extremely long if we use all existing event types. Passing such a long sequence into a model directly is prohibited by computational constraints. We expect performance gains if we could exploit all available event types of EHR by replacing with a modern memory-efficient architecture such as performer[5], and S4[10].

Future works.   From the perspective of the large-scale EHR learning, we can consider applying self-supervised pre-training strategies to our model. In fact, we examined some well-known pre-training methods including MLM [8], SpanMLM [14], wav2vec 2.0 [1], but found none of them to be effective. We leave the development of advanced EHR-specific pre-training methods for UniHPF as our future work.


  • [1] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli (2020) Wav2vec 2.0: a framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33, pp. 12449–12460. Cited by: §5.
  • [2] E. Choi, M. T. Bahadori, A. Schuetz, W. F. Stewart, and J. Sun (2016) Doctor ai: predicting clinical events via recurrent neural networks. In Machine learning for healthcare conference, pp. 301–318. Cited by: §2.
  • [3] E. Choi, M. T. Bahadori, E. Searles, C. Coffey, M. Thompson, J. Bost, J. Tejedor-Sojo, and J. Sun (2016) Multi-layer representation learning for medical concepts. pp. 1495–1504. Cited by: §3.2.
  • [4] E. Choi, Z. Xu, Y. Li, M. Dusenberry, G. Flores, E. Xue, and A. Dai (2020) Learning the graphical structure of electronic health records with graph convolutional transformer. In

    Proceedings of the AAAI conference on artificial intelligence

    Vol. 34, pp. 606–613. Cited by: §2.
  • [5] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. (2020) Rethinking attention with performers. arXiv preprint arXiv:2009.14794. Cited by: §5.
  • [6] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou (2017) Word translation without parallel data. arXiv preprint arXiv:1710.04087. Cited by: §4.2.
  • [7] ". Cost and U. P. (HCUP) (2016) HCUP clinical classifications software (ccs) for icd-9-cm. Agency for Healthcare Research and Quality, Rockville, MD. Cited by: item 6.
  • [8] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2, §5.
  • [9] Github (2022) repository. Note: https://github.com/hoon9405/UniHPF Cited by: §4.1, §4.1, §4.1, §4.6.
  • [10] A. Gu, K. Goel, and C. Ré (2022) Efficiently modeling long sequences with structured state spaces. In The International Conference on Learning Representations (ICLR), Cited by: §5.
  • [11] K. Hur, J. Lee, J. Oh, W. Price, Y. Kim, and E. Choi (2022) Unifying heterogeneous electronic health records systems via text-based code embedding. In Conference on Health, Inference, and Learning, pp. 183–203. Cited by: §1, §2, §3.2, 3rd item, §4.1, §4.6.
  • [12] A. Johnson, L. Bulgarelli, T. Pollard, L. A. Celi, R. Mark, and S. Horng (2021) MIMIC-iv-ed. Cited by: §1, §4.1.
  • [13] A. E. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3 (1), pp. 1–9. Cited by: §1, §3.2, §4.1.
  • [14] M. Joshi, D. Chen, Y. Liu, D. S. Weld, L. Zettlemoyer, and O. Levy (2020) Spanbert: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8, pp. 64–77. Cited by: §5.
  • [15] Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzel (2015) Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677. Cited by: §2.
  • [16] J. C. Mandel, D. A. Kreda, K. D. Mandl, I. S. Kohane, and R. B. Ramoni (2016) SMART on fhir: a standards-based, interoperable apps platform for electronic health records. Journal of the American Medical Informatics Association 23 (5), pp. 899–908. Cited by: §1, §2.
  • [17] M. McDermott, B. Nestor, E. Kim, W. Zhang, A. Goldenberg, P. Szolovits, and M. Ghassemi (2021) A comprehensive ehr timeseries pre-training benchmark. In Proceedings of the Conference on Health, Inference, and Learning, pp. 257–278. Cited by: §2, §3.2, Figure 3, §4.1, §4.2, §4.3.
  • [18] P. Nguyen, T. Tran, N. Wickramasinghe, and S. Venkatesh (2016) Deepr: a convolutional net for medical records. IEEE journal of biomedical and health informatics 21 (1), pp. 22–30. Cited by: §2.
  • [19] T. J. Pollard, A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark, and O. Badawi (2018) The eicu collaborative research database, a freely available multi-center database for critical care research. Scientific data 5 (1), pp. 1–13. Cited by: §1, §3.2, §4.1.
  • [20] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun, et al. (2018) Scalable and accurate deep learning with electronic health records. NPJ Digital Medicine 1 (1), pp. 1–10. Cited by: §1, §2, §3.2, 2nd item.
  • [21] R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725. Cited by: §3.2.
  • [22] J. Shang, T. Ma, C. Xiao, and J. Sun (2019) Pre-training of graph augmented transformers for medication recommendation. arXiv preprint arXiv:1906.00346. Cited by: §2.
  • [23] H. Song, D. Rajan, J. J. Thiagarajan, and A. Spanias (2018)

    Attend and diagnose: clinical time series analysis using attention models

    In Thirty-second AAAI conference on artificial intelligence, Cited by: §2, §3.2, 1st item.
  • [24] L. Song, C. W. Cheong, K. Yin, W. K. Cheung, B. C. Fung, and J. Poon (2019) Medical concept embedding with multiple ontological representations.. 19, pp. 4613–4619. Cited by: §3.2.
  • [25] S. Wang, M. B. McDermott, G. Chauhan, M. Ghassemi, M. C. Hughes, and T. Naumann (2020) Mimic-extract: a data extraction, preprocessing, and representation pipeline for mimic-iii. In Proceedings of the ACM conference on health, inference, and learning, pp. 222–235. Cited by: §2.
  • [26] Z. Wu, C. Xiao, L. M. Glass, D. M. Liebovitz, and J. Sun (2022) AutoMap: automatic medical code mapping for clinical prediction model deployment. arXiv preprint arXiv:2203.02446. Cited by: §1, §2, §4.2.