Benchmarking machine learning models on eICU critical care dataset

Progress of machine learning in critical care has been difficult to track, in part due to absence of public benchmarks. Other fields of research (such as vision and NLP) have already established various competitions and benchmarks, whereas only recent availability of large clinical datasets has enabled the possibility of public benchmarks. Taking advantage of this opportunity, we propose a public benchmark suite to address four areas of critical care, namely mortality prediction, estimation of length of stay, patient phenotyping and risk of decompensation. We define each task and compare the performance of both clinical models as well as baseline and deep models using eICU critical care dataset of around 73,000 patients. Furthermore, we investigate the impact of numerical variables as well as handling of categorical variables for each of the defined tasks.


page 1

page 2

page 3

page 4


Benchmark of Deep Learning Models on Large Healthcare MIMIC Datasets

Deep learning models (aka Deep Neural Networks) have revolutionized many...

A Machine Learning System for Retaining Patients in HIV Care

Retaining persons living with HIV (PLWH) in medical care is paramount to...

A Self-Correcting Deep Learning Approach to Predict Acute Conditions in Critical Care

In critical care, intensivists are required to continuously monitor high...

Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data

The Large Scale Visual Recognition Challenge based on the well-known Ima...

The Care Label Concept: A Certification Suite for Trustworthy and Resource-Aware Machine Learning

Machine learning applications have become ubiquitous. This has led to an...

The Health Gym: Synthetic Health-Related Datasets for the Development of Reinforcement Learning Algorithms

In recent years, the machine learning research community has benefited t...

Looking for Out-of-Distribution Environments in Critical Care: A case study with the eICU Database

Generalizing to new populations and domains in machine learning is still...

1 Introduction

Increasing availability of clinical data and advances in machine learning have addressed a wide range of healthcare problems, such as risk assessment and prediction both in acute, chronic and critical care. Critical care is an especially data-intensive field, as continuous monitoring of patients in Intensive Care Units (ICU) generates large streams of data that can then be harnessed by machine learning algorithms. However, progress in harnessing digital health data faces several obstacles, including reproducibility and comparison of results between competing models. While other areas of machine learning research, such as image and natural language processing have established a number of competitions and benchmarks (such as ILSVRC on N2C2, respectively), progress in machine learning for critical care has been difficult to measure, in part due to absence of public benchmarks. However, availability of large clinical data sets, including MIMIC III

(johnson2016mimic) and more recently eICU (pollard2018eicu) are opening the possibility of establishing public benchmarks and consequently tracking the progress of machine learning models in critical care. In this paper, we propose a public benchmark suite to address four areas of critical care, namely mortality prediction, estimation of length of stay, patient phenotyping and risk of decompensation. We define each task and evaluate our algorithms on a dataset of 73,718 patients (containing 4,564,844 clinical records). While, there has been an initial work in this area that has focused on MIMIC III clinical dataset (harutyunyan2019multitask), our work is the first to focus on a multi-center intensive care unit dataset, the eICU database (pollard2018eicu).

The main contributions of this work include: i) we provide the baseline performance and compare it with our benchmark result, using a model based on bidirectional LSTM; ii) investigate impact of categorical and numerical variables on all four benchmarking tasks; iii) evaluate entity embedding for categorical variables, versus one hot encoding; iv) show that for some tasks the number of variables can be reduced significantly without greatly impacting prediction performance; and v) we report six evaluation metrics for each of the tasks, facilitating direct comparison with future results. The source code for our experiments will be made public at so that anyone with access to eICU database can replicate our experiments and build upon our work.

2 eICU dataset description and cohort selection

Overall Dead at Hospital Alive at Hospital
Admissions 73718 6167 67551
Age 62.41 [52-75] 68.12 [59-80] 61.8 [52-75]
Gender (F) 33544 (45.5) 2830 (45.8) 30714 (45.4)
Caucasian 56973 (77.2) 4866 (78.9) 52107 (77.1)
African American 7982 (10.8) 582 (9.4) 7400 (10.9 )
Hispanic 2937 (3.98) 226 (3.6) 2711 (4)
Asian 1174 (1.59) 97 (1.5) 1077 (1.5)
Native American 413 (0.56) 42 (0.68) 371 (0.54)
Unknown 4239 (5.7) 354 (5.7) 3885 (5.7)
Hospital LoS* (days) 5.29 [2.53-6.84] 3.9 [1.42-5.22] 5.41 [2.65-6.92]
ICU LoS* (days) 2.32 [1.01-2.91] 3.17 [1.19-4.43] 2.24 [1-2.83]
Hospital Death 6167 (8.36) 6167 (100) -
ICU Death 4575 (6.2) 4575 (74.1) -
Table 1: Characteristics and mortality outcome measures. *LoS (Length of Stay). Continuous variables are presented as Median [Interquartile Range Q1–Q3]; binary or categorical variables as Count (%)

The eICU Collaborative Research Database (pollard2018eicu) is a multi-center intensive care unit database with high granularity data for over 200,000 admissions to ICUs monitored by eICU programs across The United States. The eICU database comprises 200,859 patient unit encounters for 139,367 unique patients admitted between 2014 and 2015 to hospitals located throughout the US. We selected adult patients only (age > 18) that had an ICU admission with at least 15 records, leading to 73,718 unique patients with median age of 62.41 years (Q1–Q3: 52-75), 45.5% female. Hospital mortality rate was 8.3% and average length of stay in hospital and in unit were 5.29 days and 3.9 days respectively (further details provided in Table 1

). The final patient cohort contained 4,564,844 clinical records where we group these records on 1 hour window, impute the missing values based on the mean of that window and take the last valid record.

Out of 31 tables in the eICU (v1.0) database we selected variables from the following tables: patient (administrative information and patient demographics), lab (Laboratory measurements collected during routine care), nurse charting (bedside documentation) and diagnosis based on advice from a clinician as well as consistency with other similar tasks reported in Section 4. Selected variables are shown in Table 2.

Variable Data Type
Heart rate Numerical
Mean arterial pressure Numerical
Diastolic blood pressure Numerical
Systolic blood pressure Numerical
O2 Numerical
Respiratory rate Numerical
Temperature Numerical
Glucose Numerical
FiO2 Numerical
pH Numerical
Height Numerical
Weight Numerical
Age Numerical
Admission diagnosis Categorical
Ethnicity Categorical
Gender Categorical
Glasgow Coma Score Total Categorical
Glasgow Coma Score Eyes Categorical
Glasgow Coma Score Motor Categorical
Glasgow Coma Score Verbal Categorical
Table 2: Selected variables for all the four tasks

3 Benchmarking experiments

3.1 Description of tasks

In this section, we define four different benchmark tasks, namely in-hospital mortality prediction, remaining length of stay forecasting, patient phenotyping, and risk of physiologic decompensation. After applying selection criteria, the resulting patient cohorts are outlined in Table 3

Task No. of patients Clinical records
In-hospital Mortality 30,680 1,164,966
RLoS 73,389 3,054,314
Phenotyping 49,768 2,192,497
Physiologic Decompensation 55,933 2,800,711
Table 3: Number of patients and records in four tasks

3.1.1 Mortality prediction

In-hospital mortality is defined as the patient’s outcome at the hospital discharge. This is a binary classification task, where each data sample spans a 1-hour window. The cohort for this task was selected based on the presence of hospital discharge status in patients’ record and length of stay of at least 48 hours (we focus on prediction during the first 24 and 48 hours). This selection criteria resulted in 30,680 patients containing 1,164,966 records.

3.1.2 Length of stay prediction

Length of stay is one of the most important factors accounting for the overall hospital costs, as such its forecast could play an important role in healthcare management (kilicc2019cost). Length of stay is estimated through analysis of events occurring within a fixed time-window, once every hour from the initial ICU admission. This is a regression task, where we use 20 clinical variables described in Table 2

. For this cohort we selected patients whose length of stay was present in their records with a duration of more than 24 hours. These selection criteria resulted in 73,389 ICU stays, containing 3,054,314 records. The mean length of stay was 1.86 days with standard deviation of 1.94 days, as shown in Table


3.1.3 Phenotyping

Phenotyping refers to the prediction for the diseases (ICD-9 codes). Since any given patient may have more than one ICD-9 code, this is defined as a multi-label classification problem. The dataset contains 767 unique ICD codes, which are grouped into 25 categories shown in Table 6. The cohort for this task, considering initial inclusion criteria as well as recorded diagnosis during the ICU stay, results in 49,768 patients.

3.1.4 Physiologic Decompensation

There are a number of ways to define decompensation, however in clinical setting majority of early warning systems, such as National Early Warning Score (NEWS) (mcginley2012national) are based on prediction of mortality within the next time window (such as 24 hours after the assessment). Following suit and keeping consistent with previously published benchmarks (harutyunyan2019multitask), we also define decompensation as a binary classification problem, where the target label indicates whether the patient dies within the next 24 hours. The cohort for this task results in 55,933 patients (2,800,711 records), where decompensation rate is around 6.5% (3664 patients).

3.2 Prediction algorithms

3.2.1 Baselines

We compare our model with two standard baseline approaches namely, logistic regression (LR) and a 1-layer artificial neural network (ANN). Both these approaches use a flattened representation of all the features concatenated in the order of the timestep. The embeddings for these models are learned in the same way as for the proposed BiLSTM model as explained in Section


3.2.2 Deep Learning models

In this section, we describe the selected clinical variables, approaches to represent these variables as well as baseline and deep models used in this study. The architecture of this work consists of three modules, namely input module, encoder module and output module as shown in Figure 1.

Input representation:

We process and model both numerical and categorical variables separately, as shown in Table 2. Categorical variables are represented using either one-hot encoding (OHE) or entity embedding (EE). OHE is the baseline approach that converts the variables into binary representation. Using this approach for our 7 categorical variables results in 429 unique records, rendering a large sparse matrix. In response, we represent each variable as an embedding and compare the performance with the OHE approach. We use entity embedding (guo2016entity)

, where each categorical variable in the dataset is mapped to a vector and the corresponding embedding is added to the patient’s record. This entity embedding is learned by the neural network during the training phase along with other parameters. So the final representation of the input at time

is as follows:

where is the numerical variable, is the categorical variable at time and is the embedding matrix learned by the model.


To capture sequential dependency in our data, we use Recurrent Neural Network (RNN) that resemble a chain of repeating modules to efficiently model sequential data

(rumelhart1988learning). They take sequential data

as input and provide a hidden representation

which captures the information at every time step in the input. Formally,

where is the input at time , is the parameter of RNN learned during training and

is a non-linear operation such as sigmoid, tanh or ReLU.

A drawback of regular RNNs is that the input sequence is fed in one direction, normally from past to future. In order to capture both past and future context, we use a Bidirectional Long Short Term Memory (BiLSTM)

(schuster1997bidirectional) (hochreiter1997long) for our model, which processes the input in both forward and backward direction. Using a BiLSTM the model is able to capture the context of a record not only by its preceding records but also with the following records, allowing the model to produce more informed predictions. The input at time is represented by both its forward context and backward context as . Similarly, the representation of the completed patient record is given by .


The choice of output layer is based on whether the benchmarking task is a regression or a classification task.

Remaining length of stay (RLoS) prediction is a regression task, in which we predict the RLoS record-wise. That is, each patient record is fed to the model to predict RLoS for that specific time step. This task is realized using a many to many architecture, where we assign a label to each patient record. The score for this task is obtained using:


where is the RLoS predicted and

is the non-linear activation function used as the prediction of RLoS cannot be negative.

In-hospital mortality and decompensation are binary classification tasks. For the in-hospital mortality the many to one architecture is applied and the classifier is as follows:


For the decompensation task, a many to many architecture is applied. Prediction at each-time step is treated as a binary classification and the classifier is defined as:


Phenotyping is defined as a multi-label task with 25 binary classifiers for each phenotype, and the score for the task is obtained using:


where is the time step and is the phenotype being predicted and is the model parameter.

Figure 1: Model architecture.

3.3 Results

In this section, we report benchmarking results of methods and prediction algorithms, focusing on answering the following questions: (a) How does performance of different models compare to performance of clinical scoring systems?; and (b) What is impact on prediction performance when using different feature sets, such as categorical and numerical variables, solely categorical and solely numerical variables?

We evaluate our model through a random 80/20 train and test split using the following evaluation metrics: for the regression tasks we report coefficient of determination , and Mean Absolute Error (MAE), while for the classification tasks we report AUROC, AUPRC, Specificity, Sensitivity, Positive Predictive Value (PPV) and Negative Predictive Value (NPV).

3.3.1 Mortality prediction

Results from this task indicate that the proposed approach of learning embeddings for categorical variables is more effective than OHE representation. This holds true for both baseline models (LR and ANN) as well as BiLSTM model, reflected in the prediction performance of each model. Furthermore, BiLSTM based model outperforms all the other approaches in predicting mortality in both 24 hour window and 48 hour window as shown in Table 4. It is interesting to note that using only categorical variables (reducing the number of variables from 20 to only 7) with embedding provides a better performance than using numerical variables only (AUROC 79.3 vs. 74 and and 80.1 vs 76.7 - for the first 24h and 48h, respectively). This holds true also for categorical variables with one hot encoding. These results suggest that entity embedding of categorical features in vector space is more effective in the prediction of mortality.

Data Model Num. Cat. Repn. AUROC AUPRC Spec. Sens. PPV NPV

First 24 hours

APACHE Not spec. 66.1 53 97 35 63 91
LR EMB 80.8 46.7 96 32 56 91
ANN EMB 82.9 50.9 97 34 62 91
BiLSTM 74 35.6 80 54 29 92
BiLSTM OHE 77.7 42.6 78 61 29 93
BiLSTM EMB 79.3 44.5 90 45 39 92
BiLSTM OHE 75.4 36.6 62 75 22 94
BiLSTM EMB 83.6 50.1 90 55 44 93

First 48 hours

LR EMB 85.4 55.07 97 36 65 91
ANN EMB 84.8 54.25 95 47 60 93
BiLSTM 76.7 39.80 97 25 52 90
BiLSTM OHE 78.9 46.8 85 56 35 93
BiLSTM EMB 80.1 47.32 79 65 31 94
BiLSTM OHE 71.3 34.12 61 67 20 93
BiLSTM EMB 86 54.45 84 71 40 95
Table 4: In-hospital mortality prediction during first 24 and 48 hours in ICU. (Num. and Cat. indicate presence of numerical and categorical variables respectively. Repn. indicates representation of categorical variables, either One Hot Encoding (OHE) or embedding (EMB) )

3.3.2 Length of stay prediction

Predicting Length of Stay (LoS) requires capturing temporal dependencies between each time-step. For this reason baseline models perform poorly due the lack of explicit modelling of temporal dependencies. The proposed BiLSTM model on the other hand is able to capture this dependency effectively and outperforms the baseline models as shown in Table 5. We can also see that the numerical variables are the most effective in prediction of LoS. However, using categorical variables encoded with OHE reduces the model performance, while entity embedding improves MAE, Kappa and AUROC measures.

Data Model Num. Cat. Repn. MAE [Day]

In ICU unit

LR EMB 0.03 1.25
ANN EMB 0.032 1.253
BiLSTM 0.79 0.47
BiLSTM EMB 0.75 0.50
BiLSTM OHE 0.70 0.52
BiLSTM EMB 0.74 0.46
Table 5: Length of stay in hospital prediction, evaluated using Mean Absolute Error (MAE)

3.3.3 Phenotyping prediction

For the phenotyping task, we focus on comparing performance (AUROC) of the proposed model on different subset of features, namely numerical versus categorical variables. Using only the categorical features, modelled as entity embeddings shows a significantly higher performance (0.84) compared to using only the numerical features (0.56) as outlined in Figure 6. Clearly categorical features are more effective in representing patients’ phenotype, since integrating both of the subsets does not significantly improve the result (0.86 from 0.84). In this task there is a wide difference between performance of the model on individual diseases, varying from 0.61 (diabetes mellitus without complications) to 0.96 (acute cerebrovascular disease). As a general trend prediction performance on acute diseases is higher (0.83) than that on chronic diseases (0.72). This may be due to slow-progressing nature of chronic diseases, where recorded ICU data is relatively short and thus unable to fully capture events related to chronic diseases.

Phenotype Prevalence Type Num & cat Num. Cat.
Respiratory failure; insufficiency; arrest 0.241 acute 0.86 0.61 0.85
Fluid and electrolyte disorders 0.155 acute 0.70 0.57 0.70
Septicemia 0.144 acute 0.90 0.65 0.89
Acute and unspecified renal failure 0.141 acute 0.75 0.55 0.73
Pneumonia 0.120 acute 0.87 0.62 0.87
Acute cerebrovascular disease 0.108 acute 0.96 0.66 0.95
Acute myocardial infarction 0.089 acute 0.92 0.58 0.91
Gastrointestinal hemorrhage 0.079 acute 0.92 0.53 0.91
Shock 0.068 acute 0.85 0.62 0.82
Pleurisy; pneumothorax; pulmonary collapse 0.039 acute 0.62 0.50 0.63
Other lower respiratory disease 0.029 acute 0.85 0.51 0.85
Complications of surgical 0.011 acute 0.71 0.54 0.73
Other upper respiratory disease 0.007 acute 0.91 0.55 0.92
Macro-average (acute diseases) - - 0.83 - -
Hypertension with complications 0.018 chronic 0.85 0.47 0.83
Essential hypertension 0.203 chronic 0.68 0.62 0.64
Chronic kidney disease 0.103 chronic 0.67 0.49 0.62
Chronic obstructive pulmonary disease 0.093 chronic 0.77 0.54 0.77
Disorders of lipid metabolism 0.054 chronic 0.72 0.54 0.71
Coronary atherosclerosis and related 0.041 chronic 0.79 0.56 0.79
Diabetes mellitus without complication 0.006 chronic 0.61 0.52 0.51
Macro-average (chronic diseases) - - 0.72 -
Cardiac dysrhythmias 0.165 mixed 0.72 0.60 0.69
Congestive heart failure; non hypertensive 0.105 mixed 0.79 0.62 0.78
Diabetes mellitus with complications 0.046 mixed 0.89 0.75 0.85
Other liver diseases 0.038 mixed 0.80 0.55 0.78
Conduction disorders 0.012 mixed 0.81 0.58 0.82
Macro-average (mixed diseases) - - 0.80 - -
Micro-average (all diseases) - - 0.86 0.56 0.84
Macro-average (all diseases) - - 0.80 0.57 0.78
Table 6: Phenotyping task on eICU (reported scores are AUROC)

3.3.4 Decompensation prediction

As mentioned in Section 3.1.4, decompensation is related to mortality prediction with the difference that we predict whether the patients survives in the next 24 hours, given the current time step. As such, time-dependence is critical. Since 3 categorical variables (out of 7) are time-independent and only 4 are time-dependent, they pose a difficult challenge for the model to be able to predict decompensation using only the time-independent categorical variables. For this reason, the model with only numerical variables outperforms other approaches as shown in Table 7.

Data Model Num. Cat. Repn. AUROC AUPRC Spec. Sens. PPV NPV

In ICU unit

LR EMB 57 64.9 6 99 62 84
ANN EMB 66.4 71.8 22 94 65 71
BiLSTM EMB 49.3 38 100 0 0 65
BiLSTM EMB 97.4 94.4 92 91 86 95
BiLSTM 98 95.6 94 90 88 95
Table 7: Decompensation risk prediction in eICU

4 Related work

A number of scoring systems have been developed for mortality prediction, including Acute Physiology and Chronic Health Evaluation (APACHE III (knaus1991apache), APACHE IV (zimmerman2006acute)) and Simplified Acute Physiology Score (le1993new) (SAPS II, SAPS III). Most of these scoring systems use logistic regression to identify predictive features to establish these scoring systems.

Providing an accurate prediction of mortality risk for patients admitted to ICU using the first 24/48 hours of ICU data could make the decision making easier and reduce the healthcare costs. In this regard, recent advanced techniques in artificial intelligence showed to outperform the conventional machine learning and clinical prediction techniques such as APACHE and SAPS

(harutyunyan2019multitask) (Purushotham2018a) (lipton2015learning)

. Mortality prediction has been a popular application for deep learning researchers in recent years, though model architecture and problem definition vary widely. Convolutional neural network and gradient boosted tree algorithm have been used by Darabi et al.

(DARABI2018306), in order to predict long-term mortality risk (30 days) on a subset of MIMIC-III dataset. Similarly, Celi et al. (celi2012database)

developed mortality prediction models based on a subset of MIMIC database using logistic regression, Bayesian network and artificial neural network.

Harutyunyan et al. (harutyunyan2019multitask) developed a deep learning model based on RNN LSTM called multi-task RNN,in order to predict mortality prediction in hospital, decompensation, phenotyping and length of stay in ICU unit. The proposed model was applied on MIMIC-III dataset. Similarly, Purushotham et al (Purushotham2018a) have done a comprehensive benchmark of several machine learning and deep learning models trained on MIMIC-III for various tasks, while results showing deep models consistently outperforming the conventional machine learning models and the scoring systems.

Previous work (Purushotham2018a)(harutyunyan2019multitask) has shown that deep learning models obtain good results on forecasting length of stay in ICU. In this regards, Tu et al (TU1993220) applied neural network based methods on a Canadian private dataset which includes patients with cardiac surgery. The proposed model was able to detect the patient with low,intermediate and high prolonged stay in ICU. Deep learning methods have been applied to predict phenotyping by Razavian et al (razavian2016multi) and Lipton et al (lipton2015learning). While the first trained LSTM and CNN for prediction of 133 diseases based on 18 laboratory tests on a private dataset including 298k patients, the latter applied an RNN LSTM on a private pediatric intensive care unit (PICU) dataset in order to classify 128 diagnoses given 13 clinical measurements.

In terms of physiologic decompensation Xu et al (xu2018raim), proposed an attention based model which outperformed several machine learning models in order to predict the decompensation event. They evaluated their proposed model on MIMIC-III Waveform Database Matched Subset. Similarly, (harutyunyan2019multitask) proposed decompensation prediction as one of the tasks in their multi-task benchmark.

5 Conclusion

In this study we have described four standardised benchmarks in critical care research. Our definition of benchmark tasks is consistent with previously published benchmarks, however we focus on the more recent eICU database, where clinical data has been collected from multiple ICU centres across the United States that may result in lower systematic bias. We provided a set of baselines for our benchmarks and show that bi-directional LSTM significantly outperforms linear models, especially in tasks with temporal dependencies, such as length of stay. Of note is the impact of entity embedding of categorical variables in further improving the performance of our LSTM-based model. We believe that our work will provide a solid basis to further improve critical care decision making and we provide the source code for other researchers that wish to replicate our experiments and build upon our results.


We gratefully acknowledge clinical input provided by Monica Moz, MD in both cohort selection as well as variable ranking and selection.