Increasing availability of clinical data and advances in machine learning have addressed a wide range of healthcare problems, such as risk assessment and prediction both in acute, chronic and critical care. Critical care is an especially data-intensive field, as continuous monitoring of patients in Intensive Care Units (ICU) generates large streams of data that can then be harnessed by machine learning algorithms. However, progress in harnessing digital health data faces several obstacles, including reproducibility and comparison of results between competing models. While other areas of machine learning research, such as image and natural language processing have established a number of competitions and benchmarks (such as ILSVRC on N2C2, respectively), progress in machine learning for critical care has been difficult to measure, in part due to absence of public benchmarks. However, availability of large clinical data sets, including MIMIC III(johnson2016mimic) and more recently eICU (pollard2018eicu) are opening the possibility of establishing public benchmarks and consequently tracking the progress of machine learning models in critical care. In this paper, we propose a public benchmark suite to address four areas of critical care, namely mortality prediction, estimation of length of stay, patient phenotyping and risk of decompensation. We define each task and evaluate our algorithms on a dataset of 73,718 patients (containing 4,564,844 clinical records). While, there has been an initial work in this area that has focused on MIMIC III clinical dataset (harutyunyan2019multitask), our work is the first to focus on a multi-center intensive care unit dataset, the eICU database (pollard2018eicu).
The main contributions of this work include: i) we provide the baseline performance and compare it with our benchmark result, using a model based on bidirectional LSTM; ii) investigate impact of categorical and numerical variables on all four benchmarking tasks; iii) evaluate entity embedding for categorical variables, versus one hot encoding; iv) show that for some tasks the number of variables can be reduced significantly without greatly impacting prediction performance; and v) we report six evaluation metrics for each of the tasks, facilitating direct comparison with future results. The source code for our experiments will be made public athttps://github.com/eICU-benchmark so that anyone with access to eICU database can replicate our experiments and build upon our work.
2 eICU dataset description and cohort selection
|Overall||Dead at Hospital||Alive at Hospital|
|Age||62.41 [52-75]||68.12 [59-80]||61.8 [52-75]|
|Gender (F)||33544 (45.5)||2830 (45.8)||30714 (45.4)|
|Caucasian||56973 (77.2)||4866 (78.9)||52107 (77.1)|
|African American||7982 (10.8)||582 (9.4)||7400 (10.9 )|
|Hispanic||2937 (3.98)||226 (3.6)||2711 (4)|
|Asian||1174 (1.59)||97 (1.5)||1077 (1.5)|
|Native American||413 (0.56)||42 (0.68)||371 (0.54)|
|Unknown||4239 (5.7)||354 (5.7)||3885 (5.7)|
|Hospital LoS* (days)||5.29 [2.53-6.84]||3.9 [1.42-5.22]||5.41 [2.65-6.92]|
|ICU LoS* (days)||2.32 [1.01-2.91]||3.17 [1.19-4.43]||2.24 [1-2.83]|
|Hospital Death||6167 (8.36)||6167 (100)||-|
|ICU Death||4575 (6.2)||4575 (74.1)||-|
The eICU Collaborative Research Database (pollard2018eicu) is a multi-center intensive care unit database with high granularity data for over 200,000 admissions to ICUs monitored by eICU programs across The United States. The eICU database comprises 200,859 patient unit encounters for 139,367 unique patients admitted between 2014 and 2015 to hospitals located throughout the US. We selected adult patients only (age > 18) that had an ICU admission with at least 15 records, leading to 73,718 unique patients with median age of 62.41 years (Q1–Q3: 52-75), 45.5% female. Hospital mortality rate was 8.3% and average length of stay in hospital and in unit were 5.29 days and 3.9 days respectively (further details provided in Table 1
). The final patient cohort contained 4,564,844 clinical records where we group these records on 1 hour window, impute the missing values based on the mean of that window and take the last valid record.
Out of 31 tables in the eICU (v1.0) database we selected variables from the following tables: patient (administrative information and patient demographics), lab (Laboratory measurements collected during routine care), nurse charting (bedside documentation) and diagnosis based on advice from a clinician as well as consistency with other similar tasks reported in Section 4. Selected variables are shown in Table 2.
|Mean arterial pressure||Numerical|
|Diastolic blood pressure||Numerical|
|Systolic blood pressure||Numerical|
|Glasgow Coma Score Total||Categorical|
|Glasgow Coma Score Eyes||Categorical|
|Glasgow Coma Score Motor||Categorical|
|Glasgow Coma Score Verbal||Categorical|
3 Benchmarking experiments
3.1 Description of tasks
In this section, we define four different benchmark tasks, namely in-hospital mortality prediction, remaining length of stay forecasting, patient phenotyping, and risk of physiologic decompensation. After applying selection criteria, the resulting patient cohorts are outlined in Table 3
|Task||No. of patients||Clinical records|
3.1.1 Mortality prediction
In-hospital mortality is defined as the patient’s outcome at the hospital discharge. This is a binary classification task, where each data sample spans a 1-hour window. The cohort for this task was selected based on the presence of hospital discharge status in patients’ record and length of stay of at least 48 hours (we focus on prediction during the first 24 and 48 hours). This selection criteria resulted in 30,680 patients containing 1,164,966 records.
3.1.2 Length of stay prediction
Length of stay is one of the most important factors accounting for the overall hospital costs, as such its forecast could play an important role in healthcare management (kilicc2019cost). Length of stay is estimated through analysis of events occurring within a fixed time-window, once every hour from the initial ICU admission. This is a regression task, where we use 20 clinical variables described in Table 2
. For this cohort we selected patients whose length of stay was present in their records with a duration of more than 24 hours. These selection criteria resulted in 73,389 ICU stays, containing 3,054,314 records. The mean length of stay was 1.86 days with standard deviation of 1.94 days, as shown in Table1.
Phenotyping refers to the prediction for the diseases (ICD-9 codes). Since any given patient may have more than one ICD-9 code, this is defined as a multi-label classification problem. The dataset contains 767 unique ICD codes, which are grouped into 25 categories shown in Table 6. The cohort for this task, considering initial inclusion criteria as well as recorded diagnosis during the ICU stay, results in 49,768 patients.
3.1.4 Physiologic Decompensation
There are a number of ways to define decompensation, however in clinical setting majority of early warning systems, such as National Early Warning Score (NEWS) (mcginley2012national) are based on prediction of mortality within the next time window (such as 24 hours after the assessment). Following suit and keeping consistent with previously published benchmarks (harutyunyan2019multitask), we also define decompensation as a binary classification problem, where the target label indicates whether the patient dies within the next 24 hours. The cohort for this task results in 55,933 patients (2,800,711 records), where decompensation rate is around 6.5% (3664 patients).
3.2 Prediction algorithms
We compare our model with two standard baseline approaches namely, logistic regression (LR) and a 1-layer artificial neural network (ANN). Both these approaches use a flattened representation of all the features concatenated in the order of the timestep. The embeddings for these models are learned in the same way as for the proposed BiLSTM model as explained in Section3.2.2.
3.2.2 Deep Learning models
In this section, we describe the selected clinical variables, approaches to represent these variables as well as baseline and deep models used in this study. The architecture of this work consists of three modules, namely input module, encoder module and output module as shown in Figure 1.
We process and model both numerical and categorical variables separately, as shown in Table 2. Categorical variables are represented using either one-hot encoding (OHE) or entity embedding (EE). OHE is the baseline approach that converts the variables into binary representation. Using this approach for our 7 categorical variables results in 429 unique records, rendering a large sparse matrix. In response, we represent each variable as an embedding and compare the performance with the OHE approach. We use entity embedding (guo2016entity)
, where each categorical variable in the dataset is mapped to a vector and the corresponding embedding is added to the patient’s record. This entity embedding is learned by the neural network during the training phase along with other parameters. So the final representation of the input at timeis as follows:
where is the numerical variable, is the categorical variable at time and is the embedding matrix learned by the model.
To capture sequential dependency in our data, we use Recurrent Neural Network (RNN) that resemble a chain of repeating modules to efficiently model sequential data(rumelhart1988learning). They take sequential data
as input and provide a hidden representationwhich captures the information at every time step in the input. Formally,
where is the input at time , is the parameter of RNN learned during training and
is a non-linear operation such as sigmoid, tanh or ReLU.
A drawback of regular RNNs is that the input sequence is fed in one direction, normally from past to future. In order to capture both past and future context, we use a Bidirectional Long Short Term Memory (BiLSTM)(schuster1997bidirectional) (hochreiter1997long) for our model, which processes the input in both forward and backward direction. Using a BiLSTM the model is able to capture the context of a record not only by its preceding records but also with the following records, allowing the model to produce more informed predictions. The input at time is represented by both its forward context and backward context as . Similarly, the representation of the completed patient record is given by .
The choice of output layer is based on whether the benchmarking task is a regression or a classification task.
Remaining length of stay (RLoS) prediction is a regression task, in which we predict the RLoS record-wise. That is, each patient record is fed to the model to predict RLoS for that specific time step. This task is realized using a many to many architecture, where we assign a label to each patient record. The score for this task is obtained using:
where is the RLoS predicted and
is the non-linear activation function used as the prediction of RLoS cannot be negative.
In-hospital mortality and decompensation are binary classification tasks. For the in-hospital mortality the many to one architecture is applied and the classifier is as follows:
For the decompensation task, a many to many architecture is applied. Prediction at each-time step is treated as a binary classification and the classifier is defined as:
Phenotyping is defined as a multi-label task with 25 binary classifiers for each phenotype, and the score for the task is obtained using:
where is the time step and is the phenotype being predicted and is the model parameter.
In this section, we report benchmarking results of methods and prediction algorithms, focusing on answering the following questions: (a) How does performance of different models compare to performance of clinical scoring systems?; and (b) What is impact on prediction performance when using different feature sets, such as categorical and numerical variables, solely categorical and solely numerical variables?
We evaluate our model through a random 80/20 train and test split using the following evaluation metrics: for the regression tasks we report coefficient of determination , and Mean Absolute Error (MAE), while for the classification tasks we report AUROC, AUPRC, Specificity, Sensitivity, Positive Predictive Value (PPV) and Negative Predictive Value (NPV).
3.3.1 Mortality prediction
Results from this task indicate that the proposed approach of learning embeddings for categorical variables is more effective than OHE representation. This holds true for both baseline models (LR and ANN) as well as BiLSTM model, reflected in the prediction performance of each model. Furthermore, BiLSTM based model outperforms all the other approaches in predicting mortality in both 24 hour window and 48 hour window as shown in Table 4. It is interesting to note that using only categorical variables (reducing the number of variables from 20 to only 7) with embedding provides a better performance than using numerical variables only (AUROC 79.3 vs. 74 and and 80.1 vs 76.7 - for the first 24h and 48h, respectively). This holds true also for categorical variables with one hot encoding. These results suggest that entity embedding of categorical features in vector space is more effective in the prediction of mortality.
First 24 hours
First 48 hours
3.3.2 Length of stay prediction
Predicting Length of Stay (LoS) requires capturing temporal dependencies between each time-step. For this reason baseline models perform poorly due the lack of explicit modelling of temporal dependencies. The proposed BiLSTM model on the other hand is able to capture this dependency effectively and outperforms the baseline models as shown in Table 5. We can also see that the numerical variables are the most effective in prediction of LoS. However, using categorical variables encoded with OHE reduces the model performance, while entity embedding improves MAE, Kappa and AUROC measures.
In ICU unit
3.3.3 Phenotyping prediction
For the phenotyping task, we focus on comparing performance (AUROC) of the proposed model on different subset of features, namely numerical versus categorical variables. Using only the categorical features, modelled as entity embeddings shows a significantly higher performance (0.84) compared to using only the numerical features (0.56) as outlined in Figure 6. Clearly categorical features are more effective in representing patients’ phenotype, since integrating both of the subsets does not significantly improve the result (0.86 from 0.84). In this task there is a wide difference between performance of the model on individual diseases, varying from 0.61 (diabetes mellitus without complications) to 0.96 (acute cerebrovascular disease). As a general trend prediction performance on acute diseases is higher (0.83) than that on chronic diseases (0.72). This may be due to slow-progressing nature of chronic diseases, where recorded ICU data is relatively short and thus unable to fully capture events related to chronic diseases.
|Phenotype||Prevalence||Type||Num & cat||Num.||Cat.|
|Respiratory failure; insufficiency; arrest||0.241||acute||0.86||0.61||0.85|
|Fluid and electrolyte disorders||0.155||acute||0.70||0.57||0.70|
|Acute and unspecified renal failure||0.141||acute||0.75||0.55||0.73|
|Acute cerebrovascular disease||0.108||acute||0.96||0.66||0.95|
|Acute myocardial infarction||0.089||acute||0.92||0.58||0.91|
|Pleurisy; pneumothorax; pulmonary collapse||0.039||acute||0.62||0.50||0.63|
|Other lower respiratory disease||0.029||acute||0.85||0.51||0.85|
|Complications of surgical||0.011||acute||0.71||0.54||0.73|
|Other upper respiratory disease||0.007||acute||0.91||0.55||0.92|
|Macro-average (acute diseases)||-||-||0.83||-||-|
|Hypertension with complications||0.018||chronic||0.85||0.47||0.83|
|Chronic kidney disease||0.103||chronic||0.67||0.49||0.62|
|Chronic obstructive pulmonary disease||0.093||chronic||0.77||0.54||0.77|
|Disorders of lipid metabolism||0.054||chronic||0.72||0.54||0.71|
|Coronary atherosclerosis and related||0.041||chronic||0.79||0.56||0.79|
|Diabetes mellitus without complication||0.006||chronic||0.61||0.52||0.51|
|Macro-average (chronic diseases)||-||-||0.72||-|
|Congestive heart failure; non hypertensive||0.105||mixed||0.79||0.62||0.78|
|Diabetes mellitus with complications||0.046||mixed||0.89||0.75||0.85|
|Other liver diseases||0.038||mixed||0.80||0.55||0.78|
|Macro-average (mixed diseases)||-||-||0.80||-||-|
|Micro-average (all diseases)||-||-||0.86||0.56||0.84|
|Macro-average (all diseases)||-||-||0.80||0.57||0.78|
3.3.4 Decompensation prediction
As mentioned in Section 3.1.4, decompensation is related to mortality prediction with the difference that we predict whether the patients survives in the next 24 hours, given the current time step. As such, time-dependence is critical. Since 3 categorical variables (out of 7) are time-independent and only 4 are time-dependent, they pose a difficult challenge for the model to be able to predict decompensation using only the time-independent categorical variables. For this reason, the model with only numerical variables outperforms other approaches as shown in Table 7.
In ICU unit
4 Related work
A number of scoring systems have been developed for mortality prediction, including Acute Physiology and Chronic Health Evaluation (APACHE III (knaus1991apache), APACHE IV (zimmerman2006acute)) and Simplified Acute Physiology Score (le1993new) (SAPS II, SAPS III). Most of these scoring systems use logistic regression to identify predictive features to establish these scoring systems.
Providing an accurate prediction of mortality risk for patients admitted to ICU using the first 24/48 hours of ICU data could make the decision making easier and reduce the healthcare costs. In this regard, recent advanced techniques in artificial intelligence showed to outperform the conventional machine learning and clinical prediction techniques such as APACHE and SAPS(harutyunyan2019multitask) (Purushotham2018a) (lipton2015learning)
. Mortality prediction has been a popular application for deep learning researchers in recent years, though model architecture and problem definition vary widely. Convolutional neural network and gradient boosted tree algorithm have been used by Darabi et al.(DARABI2018306), in order to predict long-term mortality risk (30 days) on a subset of MIMIC-III dataset. Similarly, Celi et al. (celi2012database)
developed mortality prediction models based on a subset of MIMIC database using logistic regression, Bayesian network and artificial neural network.
Harutyunyan et al. (harutyunyan2019multitask) developed a deep learning model based on RNN LSTM called multi-task RNN,in order to predict mortality prediction in hospital, decompensation, phenotyping and length of stay in ICU unit. The proposed model was applied on MIMIC-III dataset. Similarly, Purushotham et al (Purushotham2018a) have done a comprehensive benchmark of several machine learning and deep learning models trained on MIMIC-III for various tasks, while results showing deep models consistently outperforming the conventional machine learning models and the scoring systems.
Previous work (Purushotham2018a)(harutyunyan2019multitask) has shown that deep learning models obtain good results on forecasting length of stay in ICU. In this regards, Tu et al (TU1993220) applied neural network based methods on a Canadian private dataset which includes patients with cardiac surgery. The proposed model was able to detect the patient with low,intermediate and high prolonged stay in ICU. Deep learning methods have been applied to predict phenotyping by Razavian et al (razavian2016multi) and Lipton et al (lipton2015learning). While the first trained LSTM and CNN for prediction of 133 diseases based on 18 laboratory tests on a private dataset including 298k patients, the latter applied an RNN LSTM on a private pediatric intensive care unit (PICU) dataset in order to classify 128 diagnoses given 13 clinical measurements.
In terms of physiologic decompensation Xu et al (xu2018raim), proposed an attention based model which outperformed several machine learning models in order to predict the decompensation event. They evaluated their proposed model on MIMIC-III Waveform Database Matched Subset. Similarly, (harutyunyan2019multitask) proposed decompensation prediction as one of the tasks in their multi-task benchmark.
In this study we have described four standardised benchmarks in critical care research. Our definition of benchmark tasks is consistent with previously published benchmarks, however we focus on the more recent eICU database, where clinical data has been collected from multiple ICU centres across the United States that may result in lower systematic bias. We provided a set of baselines for our benchmarks and show that bi-directional LSTM significantly outperforms linear models, especially in tasks with temporal dependencies, such as length of stay. Of note is the impact of entity embedding of categorical variables in further improving the performance of our LSTM-based model. We believe that our work will provide a solid basis to further improve critical care decision making and we provide the source code for other researchers that wish to replicate our experiments and build upon our results.
We gratefully acknowledge clinical input provided by Monica Moz, MD in both cohort selection as well as variable ranking and selection.