Log In Sign Up

HiRID-ICU-Benchmark – A Comprehensive Machine Learning Benchmark on High-resolution ICU Data

by   Hugo Yèche, et al.
ETH Zurich

The recent success of machine learning methods applied to time series collected from Intensive Care Units (ICU) exposes the lack of standardized machine learning benchmarks for developing and comparing such methods. While raw datasets, such as MIMIC-IV or eICU, can be freely accessed on Physionet, the choice of tasks and pre-processing is often chosen ad-hoc for each publication, limiting comparability across publications. In this work, we aim to improve this situation by providing a benchmark covering a large spectrum of ICU-related tasks. Using the HiRID dataset, we define multiple clinically relevant tasks developed in collaboration with clinicians. In addition, we provide a reproducible end-to-end pipeline to construct both data and labels. Finally, we provide an in-depth analysis of current state-of-the-art sequence modeling methods, highlighting some limitations of deep learning approaches for this type of data. With this benchmark, we hope to give the research community the possibility of a fair comparison of their work.


An Extensive Data Processing Pipeline for MIMIC-IV

An increasing amount of research is being devoted to applying machine le...

MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III

Robust machine learning relies on access to data that can be used with s...

AnomalyBench: An Open Benchmark for Explainable Anomaly Detection

Access to high-quality data repositories and benchmarks have been instru...

Evaluating Progress on Machine Learning for Longitudinal Electronic Healthcare Data

The Large Scale Visual Recognition Challenge based on the well-known Ima...

Yes We Care! – Certification for Machine Learning Methods through the Care Label Framework

Machine learning applications have become ubiquitous. Their applications...

1 Introduction

Severely ill patients require treatment and surveillance in Intensive Care Units (ICU). Critical health conditions are characterized by the presence or risk of developing life-threatening organ dysfunction. During a patient’s stay in the ICU continuous monitoring of organs function parameters, enables early recognition of physiological deterioration and rapid commencement of appropriate interventions. Recent research shows the great success of machine learning methods when applied to ICU time series ncle ; horn2020set . One of the main goals of previous work was to develop new methods for prediction tasks relevant for clinical decision-making. Exemplary of such tasks are alarm systems that predict different types of organ failure hyland2020early ; tomavsev2019clinically .

To develop and evaluate such methods only a small number of large-scale datasets are freely-accessible: The MIMIC-III johnson2016mimic and IV MIMIC-IV datasets, AmsterdamUMCdb ams , HiRID (hirid, ) and the eICU Collaborative Research Database pollard2018eicu . However, these datasets are not provided in a pre-processed form directly suitable for machine learning nor do they have well-defined tasks, making it impossible to fairly compare works johnson2017reproducibility . While some pre-processed alternatives with well-defined tasks exist (goldberger2000physiobank, ; reyna2019early, ), they are often lacking in terms of size and diversity of tasks. We provide more detail about this in section 2. This leads to situations where works compare methods on their private data (tomavsev2019clinically, ) or only on limited data and number of tasks. Also the lack of relevant clinical sub-tasks for benchmarking stops the development of new methods for clinical decision support systems harutyunyan2019multitask . Finally, as in other fields, with recent datasets such as HiRID (hirid, ) the time resolution of data has greatly increased. However, no benchmark on ICU time series using such high-resolution datasets currently exists.

To improve this situation, in this paper we provide an in-depth benchmark based on the HiRID dataset (hirid, ; hyland2020early, )111, which was released on Physionet hirid alongside the publication on the circulatory Early Warning Score (circEWS) hyland2020early . HiRID is a freely accessible critical care dataset containing data recorded at the Department of Intensive Care Medicine, Bern University Hospital, Switzerland (ICU). The dataset was developed in cooperation with the Swiss Federal Institute of Technology (ETH), Zürich, Switzerland. We define a new benchmark on HiRID composed of various clinically relevant tasks and provide a comprehensive pipeline, which includes all steps from preprocessing to model evaluation. To assess the different aspects of the benchmarked machine learning methods, we diversify the tasks around specific challenges of ICU data such as prediction frequency, class imbalance, or organ dependency of the task. To profit from data acquisition advances and allow improvement on longer time series, we use a resampled data resolution of 5 min. HiRID has a higher time resolution than any other published critical care dataset and it motivates us to provide a comprehensive benchmark suite on this data-set. Also, we believe that this dataset will motivate the construction of new predictive methods for the healthcare field, going beyond ICU time series.

The main contributions of this paper are:

  • We developed a comprehensive, end-to-end pipeline for time-series analysis of critical care data based on the recently published HiRID dataset. This pipeline includes the following stages: data preprocessing mode, training mode, and evaluation mode.

  • We proposed and implemented a variety of tasks relevant to healthcare workers in the ICU, diversified in terms of type, prediction resolution, and label prevalence. The tasks cover all major organ systems as well as the general patient state. We included regression and classification (binary and multi-class) tasks.

  • By providing a comprehensive benchmark on a set of canonical tasks, we give the research community around predictive modeling on ICU time series the possibility for the clear comparison of their methods.

The paper is organized as follows: in Section 2 we provide an overview of existing ICU datasets and benchmarking papers. We provide details about the HiRID dataset and introduce the tasks defined in collaboration with clinicians in Section 3 and give more details on the tasks in Appendix A: Dataset Details. Section 4 describes the pipeline design, with more details given in Appendix B: HiRID-ICU-Pipeline Details. Section 5 describes the experiment and ablation study. In Section 6 we discuss the observed results and relate this paper to other benchmarks and related tasks relevant for clinicians.

2 Related Work

The main goal of this work is to provide a benchmark on the HiRID dataset for various clinical prediction tasks of interest. We describe here other ICU datasets as well as existing benchmarks for ICU data.

ICU time-series datasets

There are several widely-used, freely-accessible datasets consisting of ICU time series. MIMIC-IV johnson2016mimic is the oldest and most widely used ICU dataset available. It consists of physiological measurements as well as information about laboratory analyses. Physiological measurements are recorded with a maximum resolution of 1 hour. The results of laboratory analysis are collected at irregular time intervals. Moreover, there are static features like gender, age, diagnosis, etc. available. The dataset consists of information recorded about 40,000 ICU stays at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA, USA. The median of the patient stay length is 2 days. The eICU Collaborative Research Database pollard2018eicu is a large multicenter critical care database made available by Philips Healthcare in partnership with the MIT Laboratory for Computational Physiology. It contains data associated with over 200,000 patient stays, but the public version does not reach the granularity of other datasets in terms of time resolution and data elements. The first version of AmsterdamUMCdb ams was released in November 2019. Its current version from March 2020 contains data related to 23,172 ICU and high dependency unit admissions of adult patients from 2003 - 2016 from Amsterdam University Medical Centers. The data includes clinical observations like vital signs, clinical scores, device data, and lab results.

Benchmarks on ICU time-series.

Among works using the openly available datasets mentioned above, to the best of our knowledge, only a single standardized benchmark exists, MIMIC-III benchmark by Harutyunyan et al. harutyunyan2019multitask . In that work four tasks were proposed, two requiring one prediction per patient stay and two dynamic tasks with one prediction per hour each. In addition, while they do not propose a benchmark, Jarrett et al. jarrettclairvoyance developed a standardized pipeline for medical time series, called Clairvoyance. Results were shown on several datasets, including MIMIC. In this spirit, some packages address a specific family of tasks, for example, classification faouzi2020pyts and forecasting guecioueurpysf . Finally, some public challenges, with curated data, were proposed in the past, e.g. for the early prediction of sepsis (Physionet 2019 challenge reyna2019early ) or mortality prediction (Physionet 2012 challenge citi2012physionet ). However, the provided datasets are smaller than HiRID and built around a single task.

3 Benchmark Design

3.1 The HiRID Dataset

HiRID (hirid, ; hyland2020early, ) is a freely accessible critical care dataset containing data from more than 33,000 patient admissions to the Department of Intensive Care Medicine, Bern University Hospital, Switzerland (ICU) from January 2008 to June 2016. It was released on Physionet hirid alongside the publication of the circulatory Early Warning Score (circEWS) hyland2020early . The dataset was developed in cooperation with the Swiss Federal Institute of Technology (ETH) Zürich, Switzerland. It contains de-identified demographic information and a total of 712 routinely collected physiological variables, diagnostic test results, and treatment parameters. HiRID has a higher time resolution than any other published ICU dataset, particularly for bedside monitoring, with most parameters recorded every 2 minutes, which motivates us to provide a comprehensive benchmark suite on this dataset. Demographic information about the patient cohort are displayed in Appendix Table 1.

Task name Task type Task description
ICU mortality Binary classification, one prediction per stay Predicted at 24h after admission to the ICU.
[0.3pt/1pt] Patient phenotyping Multi-class classification, one prediction per stay Classifying the patient after 24h regarding the admission diagnosis, using the APACHE group II and IV labels222APACHE II and IV Zimmerman2006-of ; Knaus1985-iw are subsequent versions of the major illness severity score used in the ICU. They also introduce a patient grouping according to admission reason. We use an aggregate of these two groupings for this task (see Appendix A: Dataset Details)
Circulatory failure 333Circulatory failure is defined as Lactate mmol/l and either mean arterial blood pressure mmHg or administration of any vasoactive drug. Binary classification, dynamic prediction throughout stay Continuous prediction of onset of circulatory failure in the next 12h, given the patient is not in failure now.
[0.3pt/1pt] Respiratory failure444Respiratory failure is defined according to the Berlin definition (ARDS_Definition_Task_Force2012-hl, ) as a P/F ratio mmHg. Binary classification, dynamic prediction throughout stay Continuous prediction of onset of respiratory failure in the next 12h, given the patient is not in failure now.
Kidney function Regression, dynamic prediction throughout stay Continuous prediction of urine production in the next 2h as an average rate in ml/kg/h. The task is predicted at irregular intervals.
[0.3pt/1pt] Remaining length of stay Regression, dynamic prediction throughout stay Continuous prediction of the remaining ICU stay duration.
Table 1: Definition of prediction tasks contained in the HiRID-ICU benchmark suite

3.2 Prediction Tasks

Our benchmark suite focuses on clinically relevant prediction tasks with a large diversity in the machine learning task types. The tasks cover most major organ systems as well as the general patient state. The major organ systems include the cardiovascular, kidney, and respiratory systems. For each organ system, we provide a prediction task related to the main organ function. Length of stay, mortality, and patient phenotyping are chosen to represent tasks regarding an overall patient state. From a machine learning point of view, our suite contains regression and classification (binary and multi-class) tasks. We included tasks with different degrees of class imbalance to diversify the spectrum further and enable the comparison of methods on e.g. highly imbalanced tasks. We chose tasks performed online throughout the stay (every 5 minutes) and at fixed times of the stay, such as 24h after ICU admission, which capture a more long-term state of the patient. To enhance reproducibility, we include some tasks previously considered in harutyunyan2019multitask , mortality, and remaining length-of-stay prediction. Table 1 contains the full task descriptions.

4 Pipeline Design

Figure 1 shows an overview of the major HiRID-ICU pipeline steps. The pipeline is designed using the preprocess-train-predict paradigm. We provide more details about it in Appendix B: HiRID-ICU Pipeline Details and the README section of the software repository555 The preprocessed data contains two data versions, common_stage and ml_stage

. The first is independent of modeling choices and serves as the starting point for future works with custom pre-processing choices. The latter is a compatible version for our pipeline with our categorical encoding, imputation, and scaling choices.

Figure 1: Detailed Pipeline. (Red) Raw Data. (Blue) Common pre-processing. (Green) Wide-format data. (Grey) Modeling depending stages.

4.1 Data Pre-processing

In its public version, HiRID, as any real-world dataset, contains certain artifacts that require pre-processing. As pointed out by (bellamy2020evaluating, ) for MIMIC-III, individual pre-processing in each work avoids a fair comparison of them. To this effect, we aim to provide a modular and reproducible pipeline. Patient EHRs in HiRID are stored in a long table format where each row of the table is a record containing the measurement value of a specific variable at a specific time for a patient, which cannot be used as a ready input for machine learning models.

Wide-format Merging

To obtain a more compact format, the first pre-processing step in our pipeline is to transform the long table of patient EHRs into feature matrices, where each column represents a clinical concept, which we call wide format. Such a data format represents an irregularly sampled multivariate time series. At this step, we also remove any physiologically impossible measurement.

High-Resolution Gridding

After this merging step, we further compact the dataset by re-sampling it to a 5 minute resolution. Thus, each time step contains the last value measured in the last 5 min if it exists, or is empty otherwise. This gridding is similar to the strategy used by (hyland2020early, ). We refer to the output of this step as the common_stage in Fig.1. Because it is independent of modeling choices, this stage provides a starting point for future approaches using different imputation and scaling choices.

Processing for Machine Learning

In the second part of the pipeline, we process the common stage of the data to be compatible with ML models’ expected input format. For this, we first use forward-filling imputation for each stay. Then, we apply one-hot encoding for categorical variables and scale the remaining ordinal or continuous variables. We standard scaled all variables with the exception of the time since admission and admission date, which we min-max scaled. By doing scaling globally, we ensure to preserve patients’ specificity (e.g.: tachycardia). We refer to the output of this stage as the

ml_stage as it is dependent on our modeling choices.

4.2 Hand-engineered Feature Extraction

In the original paper describing the HiRID dataset (hyland2020early, ), the authors showed that boosted tree ensembles such as LGBM (ke2017lightgbm, ), when provided with hand-engineered features, outperform state-of-the-art deep learning methods. Based on this observation, we include in our pipeline the possibility to extract such features from the common_stage of the data. For our models, we extracted four features for each non-categorical variable over the entire history: minimum past value, maximum past value, mean past value and density of measurement, which is the proportion of time points where a value is provided among all possible time points in the history666This is done on the regularly sampled version of the data. These features are then included in the ml_stage.

4.3 Label Construction and Splitting

We construct prediction task labels using the provided measurements and meta-data for both continuous and stay-level tasks. As an intermediate step for label construction, we use a forward imputed version of the data, as in the modeling stage. Concerning the experimental design, we use a random split by patients. The training set contains 70% of the patients and validation and test sets, 15% each. A temporal splitting strategy as used by Hyland et al. (hyland2020early, ) would be more clinically relevant but information about admission time was removed to preserve anonymity when the dataset was originally published. While longer stays exist in the dataset, for computational reasons, we limited labeling to the first 7 days of stays (2016 steps of 5 min). This cropping affects less than 6% of all stays.

Task name Train set Validation set Test set
Circ. failure 1.4 % (n=14.12M) 1.3 % (n=3.01M) 1.4 % (n=2.96M)
Resp. failure 8.7 % (n=5.58M) 8.4 % (n=1.21M) 8.4 % (n=1.20M)
Mortality 8.7 % (n=10525) 7.1 % (n=2206) 8.3 % (n=2231)
Phenotyping 0.2 % (n=10470) 0.1 % (n=2194) 0.1 % (n=2217)
Kidney function 1.17 ml/kg/h (n=341424) 1.12 ml/kg/h (n=71549) 1.18 ml/kg/h (n=70642)
Rem. LOS 41.04h (n=15.15M) 41.51h (n=3.22M) 39.64h (n=3.17M)
Table 2: Label statistics for each of the tasks, in the training, validation and test sets. As a metric, for binary classification tasks, the positive label prevalence is reported. For multi-class classification tasks, the class prevalence of the minority class is reported. Finally, for regression tasks the median of the label distribution is reported. In parentheses the number of samples is reported. M: Million.

4.4 Model Training and Evaluation

The final part of the pipeline contains an end-to-end machine learning suite to train and evaluate our models depicted on the right hand side of Fig.1. ML approaches were implemented using scikit-learnscikit-learn and lightgbmke2017lightgbm , whereas DL approaches were implemented in pytorch paszke2019pytorch . All DL models were trained using Adam optimizer kingma2014adam , with a cross-entropy objective for classification tasks and mean-squared error (MSE) for regression tasks. For classification we provide the possibility to balance loss weights according to class prevalence as in king2001logistic .

For the evaluation of models, we use a range of metrics relevant to each task. For classification tasks, we considered AUROC777 and AUPRC888 metrics in the binary case, and balanced accuracy (B-Accuracy) brodersen2010balanced in the multi-class one. For regression tasks, we used mean absolute error (MAE) as a comparison metric. Regardless of the task or model, we used the scikit-learn implementation for all metrics. More details about this stage of the pipeline can be found in Appendix B: HiRID-ICU-Pipeline Details.

5 Experiments

5.1 Settings

For all models, we tuned specific hyper-parameters using random search. Each randomly picked set of parameters was run with 3 different random initializations. We then selected hyper-parameters on the validation set performance for either AUPRC, B-Accuracy, or MAE. All models were trained with early stopping on the validation loss. Further details about hyper-parameters can be found in Appendix B: HiRID-ICU-Pipeline Details.

Because of the class imbalance existing in classification tasks, we considered balanced loss weights for all methods. However as further discussed in subsection 5.5, this technique was relevant only for the Patient Phenotyping task. For regression tasks, we min-max scaled the labels at training time to avoid exploding gradients.

5.2 Benchmarked Methods

In our proposed benchmark, we considered two groups of machine learning algorithms. The first group consists of regular machine learning algorithms, which as shown are highly effective for ICU-related tasks (sadeghi2018early, ; harutyunyan2019multitask, ; hyland2020early, )

. It is composed of a Gradient Boosting method with LightGBM

(ke2017lightgbm, )

and Logistic Regression. The second group is focused on deep learning methods. We select the most commonly used sequence models for this group: Recurrent neural networks (LSTM 

(hochreiter1997long, ) and GRU (cho2014learning, )

), convolutional neural networks (CNN), in particular, temporal convolutional networks (TCN) 

bai2018empirical and Transformer models vaswani2017attention .

5.3 Benchmarking Models on High-resolution ICU Data

In this section, we compare the earlier described methods on all tasks. While DL approaches are provided with the entire history for all time points, ML methods use only the values of the current step as an input. Thus one would expect the latter models to perform significantly worse due to the lower amount of information provided.

Task ICU Mortality Patient Phenotyping
Metric AUPRC () AUROC () B-Accuracy ()
LR 58.1 0.0 89.0 0.0 39.1 0.0
LGBM 54.6 0.8 88.8 0.2 40.4 0.8
LGBM w. Feat. 62.6 0.0 90.5 0.0 45.8 2.0
GRU 60.3 1.6 90.0 0.4 39.2 2.1
LSTM 60.0 0.9 90.3 0.2 39.5 1.2
TCN 60.2 1.1 89.7 0.4 41.6 2.3
Transformer 61.0 0.8 90.8 0.2 42.7 1.4
Table 3: Benchmark of methods for stay level tasks.(Top rows) ML methods; (Bottom rows) DL methods. All scores are averaged over 10 runs with different random seeds such that the reported score is of the form

. In bold are the methods within one standard deviation of the best one. Classification metrics were scaled to 100 for readability purposes.

Task Circulatory failure Respiratory failure Kidney func. Remaining LOS
Metric AUPRC () AUROC () AUPRC () AUROC () MAE () MAE ()
LR 35.6 0.0 96.9 0.0 49.3 0.0 91.3 0.0 N.A N.A
LGBM 43.5 0.3 97.7 0.0 54.2 0.3 92.8 0.0 0.45 0.00 56.9 0.4
LGBM w. Feat. 43.9 0.6 97.7 0.0 55.7 0.1 93.1 0.0 0.45 0.00 57.0 0.3
GRU 41.9 0.3 97.5 0.0 52.8 0.5 92.6 0.1 0.49 0.02 60.6 0.9
LSTM 36.4 1.3 96.6 0.2 51.6 0.8 92.3 0.1 0.50 0.01 60.7 1.6
TCN 41.5 0.5 97.5 0.0 53.4 0.4 92.7 0.1 0.50 0.01 59.8 2.8
Transformer 42.1 0.6 97.6 0.0 53.7 0.4 92.7 0.1 0.48 0.02 59.5 2.8
Table 4: Benchmark of methods for online monitoring tasks. (Top rows) ML methods; (Bottom rows) DL methods. All scores are averaged over 10 runs with different random seeds such that the reported score is of the form . In bold are the methods within one standard deviation of the best one. Classification metrics were scaled to 100 for readability purposes. MAE is in units ml/kg/h for Kidney Function and in hours for Remaining LOS.
Stay-Level Tasks

When comparing methods on tasks requiring a single prediction after 24h (Table 3), we observe the superiority of LGBM with hand-extracted features. Transformers outperformed other DL methods but we observe a significant performance gap with the best ML method in B-Accuracy for Patient Phenotyping and AUPRC for ICU Mortality. Concerning GRU and LSTM, their performance is similar to TCN’s for ICU Mortality. However, on the Patient Phenotyping task, they do not manage to outperform even logistic regression.

Online Failure Predictions

For the continuous classification tasks, where the maximum sequence length extends from 288 steps to 2016, DL methods do not leverage the additional history information. Indeed, as shown in Table 4, for both Circulatory and Respiratory Failure, LGBM trained only on the current variables outperforms all DL methods. Among these methods, LSTM is the most impacted, as it has noticeably lower scores. Finally, for all continuous tasks, including regression discussed below, the improvement brought by hand-extracted features is not as significant. It suggests that statistical features, when extracted from the entire history, are less informative.

Online Regression Tasks

The final set of tasks we benchmark are regression tasks (Table 4). As for the classification case, LGBM-based methods outperform DL methods, which, among them, have similar performance. In addition, we do not observe any improvement brought by our selection of hand-extracted features. Moreover, the overall performances of the proposed methods are relatively low. While a MAE of ml/kg/h for Kidney Function is only two times smaller than the median urine output rate, a h error in Remaining LOS is more than two times the median length-of-stay. We believe these low scores are due to the nature of the labels’ distributions, which are both heavy-tailed as shown in Appendix A: Dataset Details.

5.4 Behaviour of Deep Learning Approaches for Long Time-Series

One notable difference between the MIMIC-III benchmark harutyunyan2019multitask and our work is the data resolution. The resolution of our data being twelve-time higher leads to 2016 steps (1 week) sequences for online tasks. Thus, we explore if the increase of sequence length explains DL methods’ decrease in performances for continuous tasks.

History Length

One way to verify if DL methods leverage long-term dependencies in their prediction is to check if a decrease in the considered history impacts performance. Among the DL methods, we can verify this for Transformers and TCN by using local attention in the first case and reducing the number of dilated convolutions in the other. In the results (Figure 2), we observe that these two models do not use the additional information provided by early steps. With only a 1h history, Transformer performance stays the same. TCN performance, on the other hand, does drop, but only for a history smaller than 6h. It could explain why LGBM, which does not have access to these extra time steps, manages to outperform both methods compared to stay-level tasks.

Figure 2: Impact of history length on online classification performance. (Left) Comparison in AUPRC for the Circulatory Failure task ; (Right) Comparison in AUPRC for the Respiratory Failure task. Error bars represent the standard deviation over 5 runs with different random initializations.
Data Resolution

Another approach to decrease the length of sequences is to reduce the data resolution. We compare all DL methods with a 1h prediction interval to assess data resolution impact on metrics. This way, we can gradually decrease the data resolution from 5min to 1h while preserving the same prediction time-steps. We report the result of this experiment in Figure 3. We observe that while TCN and Transformer performance are almost identical, GRU and LSTM are both impacted in opposite ways. GRU is noticeably better than LSTM on both tasks with a 5min grid, but as resolution lowers to 1h, methods’ performances become similar.

Figure 3: Impact of data resolution on online classification performance. (Left) Comparison in AUPRC for the Circulatory Failure task ; (Right) Comparison in AUPRC for the Respiratory Failure task. Error bars represent the standard deviation over 5 runs with different random initializations.

5.5 On Weighting Cross-entropy by Class Prevalence

All the tasks we define show a high degree of imbalance and the class imbalance problem (CIP) is known to be highly challenging johnson2019survey . The most common approach to this problem is the use of class weights in the loss objective. For this ablation, we adopt the original idea from (king2001logistic, ) by defining weights inversely proportional to class prevalence. However, as shown in Table 5, such a weighting choice decreases performance in all binary classification tasks.

Task ICU Mortality Phenotyping Circulatory Failure Respiratory Failure
Metric AUPRC () AUROC () B-Accuracy () AUPRC () AUROC () AUPRC () AUROC ()
LGBM w. Feat. -1.3 0.0 +4.3 -0.8 -0.2 -4.4 -1.1
Transformer -2.6 0.0 +4.0 -0.6 0.0 -0.5 0.0
Table 5: Deltas in metrics of using balanced cross-entropy loss. (Green) Improvements over using no weights; (Red) Deterioration over using no weights.

6 Discussion

In this paper, we provided an in-depth benchmark on the HiRID dataset and evaluated the behaviour of machine learning models on various clinically relevant tasks developed in collaboration with intensive care clinicians. Our primary contribution is a full and reproducible preprocessing and machine learning pipeline and benchmark tasks on a public intensive care dataset, a necessary prerequisite for reproducible and comparable research in the future. We further evaluate current state-of-the-art machine and deep learning algorithms on these tasks establishing a baseline to compare future methods against. We consider this our second major contribution.

This work confirms previous results (hyland2020early, )

, that conventional machine learning models (i.e. boosted ensembles of decision trees) outperform current deep learning approaches on medical time series problems. Based on the experimental results we found that deep learning models do not lead to the same breakthrough performance increases as in other domains (such as NLP 


or Computer Vision 

dosovitskiy2020image ). We believe the sparsity of the data and the imbalance of labels in both regression and classification tasks play an important role in this. For classification tasks, building a specific objective for highly imbalanced tasks such as Focal loss lin2017focal might be a potential area of research. For regression, recent work has shown some promising leads for heavy-tailed regression tasks yang2021delving . Moreover, HiRID introduces a novel high-resolution aspect in ICU data, that needs to be correctly taken into account. Thus, as for other sequence data, one possible explanation could be that when trained with extremely long sequences, models can not use the extracted features in the most effective way zaheer2020big . In the case of Transformers, to force the model to learn and extract useful patterns, various kinds of improvements could be made tay2020efficient . In particular, learnable patterns could be incorporated roy2021efficient .

Our work goes beyond previous ICU time-series benchmarks (e.g. (harutyunyan2019multitask, )) by using a more diverse set of tasks and a data set with a higher time resolution. As discussed earlier the set of clinical prediction tasks is diverse regarding the assessed organ systems, prevalence, and task type. An important limitation of our study is that HiRID is currently not the most frequently used and hence known ICU data set.

This work facilitates the future development of machine learning methods and standardized comparison of their performance on a diverse set of predefined tasks. It could contribute to solving today’s problem of machine learning on medical time series not being comparable due to each work’s unique datasets, preprocessing, and tasks definition. We hypothesize that methods developed and successfully evaluated on these tasks can also be successfully transferred to other specific medical time series problems.

This work also fills the gap between proposed machine learning approaches and their applications to ICU tasks. As a concrete example, COVID-19 is a big challenge for ICU patient monitoring. Important issues in this context are the uncertainty of the patient’s prognosis as well as the prediction of the disease progression. COVID-19 is known to cause respiratory failure li2020neuroinvasive , one of the tasks studied in our benchmark, which is also the main cause for ICU admission and death li2020acute ; ruan2020clinical ; richardson2020presenting ; holter2020systemic . During the current COVID-19 pandemic, e.g. first attempts to construct a Respiratory Failure prediction model were already done bolourani2021machine , however, the data is available only for a limited audience, limiting reproducibility.

7 Conclusion

In this paper, we proposed an in-depth benchmark on time series collected from an Intensive Care Unit (ICU). In collaboration with clinicians, we defined several tasks relevant for healthcare covering different critical aspects of ICU patient monitoring. We provide a reproducible end-to-end pipeline to derive both data and labels, and a training setup to evaluate the final performance. We hope that this benchmark facilitates the construction and evaluation of machine learning methods for ICU data, and encourages reproducible research in this field.


Dataset description

The HiRID ICU data is provided by the University Hospital Bern, Bern, Switzerland. The dataset consists of 33,905 patients whose length of stays in the ICU range from 0 to 28 days. A more detailed summary of the HiRID cohort statistics can be found in Table 6.

For all patients, a variety of clinical measurements are collected. It represents a set of 710 variables that can be categorized into the following types:

  • Demographics (e.g.: Sex, Age, Height)

  • Bedside vital signs (e.g.: Heart rate)

  • Settings of medical devices (e.g.: Mechanical ventilation)

  • Manual observations (e.g.: Urine output)

  • Lab measurements (e.g.: Lactate)

  • Treatments (e.g.: Vasopressor agents)

One notable difference between these types of measurements is the resolution at which they are provided. Bedside vital signs are provided at regular intervals of 2 min, whereas lab measurements are only available every couple of hours at best. The frequency of recording discrepancy existing between different variables yields unique challenges for machine learning models. Further details about the available clinical measurements can be found in the official documentation of HiRID999URL for the official HiRID documentation as well as in our software repository101010 Table containing all information for each variable:

Sex Female 12,138 patients
Male 21,767 patients
Age group [20,30) 1,042 patients
[30,40) 1,278 patients
[40,50) 2,649 patients
[50,60) 5,194 patients
[60,70) 8,241 patients
[70,80) 9,445 patients
[80,90) 5,534 patients
[90,100) 522 patients
Discharge status Alive 31,604 patients
Dead 2,062 patients
Unknown 239 patients
APACHE group (Patient phenotype) Cardiovascular Surgical 8,125 patients
Non-surgical 4,356 patients
Neurologic Surgical 3,938 patients
Non-surgical 6,014 patients
Gastrointestinal Surgical 1,768 patients
Non-surgical 1,918 patients
Respiratory Surgical 631 patients
Non-surgical 2,399 patients
Trauma Surgical 239 patients
Non-surgical 1,522 patients
Other medical diseases 1,428 patients
Other surgical 568 patients
Metabolic/Endocrinology 630 patients
Hematologic 99 patients
Renal surgical 81 patients
Unknown 189 patients
Length of stay Median 0.95 days
Range (0,28] days
Table 6: Cohort statistics of the public HiRID dataset v1.1.1, as released on Physionet, which was used for the HiRID-ICU benchmark

Data preprocessing

Artifact removal

The HiRID data when directly exported from the data management system contains various types of artifacts. The most common one concerns measurement values that are out of the normal range defined by clinicians. Another artifact that we observed is that the time of the first measurement record of cumulative variables is at noon by default, which could be much earlier than the admission time of the patient. This, if not taken into account correctly, will affect the rate calculation for those cumulative variables. To solve these timestamp issues, we clip the time of the first record of the cumulative variables to the ICU admission time of the patient.

Variable merging

There are often various variables in the ICU EHRs that have the same or similar clinical meaning. As an example, there exist three variables for temperature, namely core body temperature, auxiliary temperature, and rectal temperature. Usually, only one or few variables from one medical concept are measured for a patient, therefore, merging variables with the same clinical concept into one meta-variable reduces the sparsity of the feature matrices. Another advantage is that machine learning models trained on medical concept features are more transferable to other hospitals than those trained on the original clinical variables because hospitals usually have their own variable coding scheme. We hence incorporated variable-merging in our first pre-processing step. A challenge that arises from merging pharmaceutical variables is the dosage normalization across drugs with similar active ingredients. Therefore, instead of using the drug dosage as features, which likely contain invalid information, we use “presence/absence” binary indicators of drugs as features for the machine learning models.

Definition of pharmaceutical acting periods

When a patient has been administrated a drug at a time point, only this time point contains a measurement. However, each drug is active for a specific duration after this administration time. For this reason, we ensure to propagate the measurement for a duration corresponding to the drug acting period. We set the value back to zero after this period ensuring compatibility with forward filling imputation.

Conversion of cumulative values to rates

There are a few fluid output variables whose measurement values are recorded as cumulative every 12 hours starting from noon on the ICU admission day. Since the cumulative signals are periodic, we converted them to hourly fluid rates which are more interpretable and amenable as machine learning features and for the endpoint definition.

Endpoint definition and statistics

Endpoint extraction algorithms

Our benchmark suite contains a range of clinically relevant tasks chosen to range over a spectrum of different properties of tasks, e.g. task type from the machine learning point-of-view (regression, binary classification, multi-class classification), point of the ICU stay at which the prediction is made (fixed time-point vs. predictions made throughout the stay), as well as the degree of class imbalance (highly imbalanced tasks vs. more balanced tasks). While in the main paper we describe the main idea of each task, the full definition and implementation details of the extraction algorithm are described below.

Dependencies on previous pipeline stages.

Static information about the patient is extracted from the static.parquet file, in which the mortality status and the APACHE-II/IV codes for the admission are stored. This file is derived from the general_table.parquet and observations tables, where APACHE II/IV codes are stored, during the data preprocessing pipeline stage. Respiratory/kidney/circulatory failure labels are generated from the merged stage of the data, via an intermediate imputation step, in which measurements are forward filled to statistics about the number of real measurements in time grid intervals were pre-computed. These simplified the implementation of the endpoints, whose extraction algorithms use a combination of imputed values and real measurement detection strategies.

Label masking on presence of vital sign monitoring.

All tasks described below were additionally masked with a condition on a current connection to vital sign monitors. This implied setting a valid label (according to the below task definitions) to invalid (NaN) if the patient had no heart rate measurement (vm1) in the 15 minutes surrounding the time grid point. Since the expected frequency of heart rate measurements is 1 sample / 2 minutes, it is likely the patient is disconnected when this condition is satisfied, and no reliable predictions on their state can be made.

Patient phenotyping.

To derive the APACHE diagnostic group of the patient, the two fields APACHE-II group and APACHE-IV group of the static HiRID data are used. To map these two fields to a standardized encoding, the conversion Table 7 is used. If the patient had a valid APACHE-II group entry and its code was mapped in the table, the target APACHE group code was used. Otherwise, we tried to fall back to the APACHE-IV group entry, where again if it was present and mapped, the corresponding target APACHE group code was used. If still no code could be extracted, the prediction task was defined as invalid for the patient. The label of the time-point 24h after ICU admission was set to the class category of the APACHE group, defining a multi-class classification problem to be solved once in the stay. If the admission was shorter than 24h, no label for the patient was assigned.

APACHE II Code APACHE IV Code Original group name Target group Target group name
98 190 Cardiovascular 1 Cardiovascular
99 191 Respiratory 2 Respiratory
100 192 Gastrointestinal 3 Gastrointestinal
101 193 Neurologic 4 Neurologic
197 Urogenital 6 Other medical diseases
102 Sepsis 6 Other medical diseases
106 198 Other 6 Other medical diseases
206 Intoxication 6 Other medical diseases
103 194 Trauma not surgical 7 Trauma not surgical
104 195 Metabolic/Endocrinology 8 Metabolic/Endocrinology
105 196 Hematologic 9 Hematologic
107 199 (Cardio)vascular surgical 11 (Cardio)vascular surgical
108 201 Respiratory surgical 12 Respiratory surgical
109 200 Gastrointestinal surgical 13 Gastrointestinal surgical
110 202 Neurologic surgical 14 Neurologic surgical
111 203 Trauma surgical 15 Trauma surgical
112 204 Renal surgical 16 Renal surgical
113 Gynacology surgical 17 Other surgical
114 Orthopedics surgical 17 Other surgical
205 Other surgical 17 Other surgical
Unknown 18 Unknown
Table 7: Mapping between APACHE II and IV group codes to one unique APACHE group diagnostic group encoding, which is used for the patient phenotyping task.

First, the ICU mortality label was extracted from the general data table of the HiRID database. The label of the time-point 24 hours after ICU admission was set to 1 (positive) if the patient died at the end of the stay according to this field, and 0 (negative) otherwise, defining a binary classification problem to be solved once per stay. If the admission was shorter than 24 hours, no label was assigned to the patient.

Remaining length of stay.

The continuous regression label of a time-point in the ICU stay was defined as the number of hours that remain until the end-of-the-stay, i.e. the first label of the ICU stay is equal to the ICU stay length (in hours) and the last label just before dispatch is defined as 0. To determine the beginning and end of the ICU stay, the first and last observed heart rate measurements (vm1) were used.

Kidney function.

As a basis for extracting the kidney function label, the urine output rate (vm24) variable and the weight variable (vm131) were used. The valid labels were anchored to real measurements, i.e. updates, of the urine output rate. Only time-points exactly 2h before an update of the urine output rate were assigned a prediction label. The overall urine output in the 2h after this time point was found by computing the integral of the urine output rate variable (vm24) over this 2h period. The overall urine output in the time interval was then normalized to the weight of the patient (unit kg) at the time of the prediction, and then standardized to a rate/hour. Hence, the unit of the regression output is ml/kg/h.

Circulatory failure.

As a basis for computing the circulatory failure status of a patient at every time-point of the ICU stay, the following variables were used: Mean arterial pressure (vm5), arterial lactate (vm136) and the vasopressive agents Milrinone (pm42), Dobutamine (pm41), Levosimendan (pm43), Theophylline (pm44), Norepinephrine (pm39), Epinephrine (pm40) and Vasopressin (pm45). For defining the label at time a centralized window of size 2h was anchored at , and the lactate/MAP conditions in this window were separately analyzed. The time-point was assigned a (tentative) positive label if, for at least 2/3 of the time-points inside the 2h, the lactate condition was satisfied, and for at least 2/3 of the time-points inside the 2h, the MAP condition was satisfied. The time-points at which they have to be satisfied do not necessarily have to coincide. The conditions defining the circulatory failure state are listed below:

  • Lactate condition. Elevated arterial lactate ( mmol/l)

  • MAP/vasopressor condition. Low MAP ( mmHg ) or any dose of any of the considered vasopressors (pm39, pm40, pm41, pm42, pm43, pm44, pm45) given to the patient, which could potentially mask a low MAP.

Using these endpoint annotations of “circulatory failure” and “no circulatory failure”, a label at time point is either (1) invalid (no prediction made), if the patient was in circulatory failure at , (2) positive if the patient was not circulatory failure at , but there was a new onset of circulatory failure in the next 12h, and (3) negative if the patient was not in circulatory failure at , and there was no onset of circulatory failure in the next 12h.

Respiratory failure.

Estimating the respiratory failure status of a patient at a given time-point involves estimating (1) their instantaneous FiO value, (2) their instantaneous PaO value and computing the P/F ratio as P/F = PaO/FiO, and then labeling the time series according to the threshold P/F = 300 mmHg, to define (mild) failure and stability periods.

We distinguished between the following cases to estimate the FiO value at a time-point .

  • FiO from ventilator: If there was a FiO measurement (vm58) in the last 30 min and the patient was on the ventilator or the ventilation mode NIV (vm60) was active, then the last FiO measurement (vm58) was directly used as the estimate.

  • FiO from supplementary oxygen (oxygen mask): If there was a real measurement in supplementary oxygen (vm23) in the last 12h, then it was forward filled and used to estimate FiO via the conversion Table 8.

  • FiO from ambient air: An ambient air FiO assumption was made, and FiO was estimated as 21 %.

Supp. oxygen [l] FiO [%]
1 26
2 34
3 39
4 45
5 49
6 54
7 57
8 58
Supp. oxygen [l] FiO [%]
9 63
10 66
11 67
12 69
13 70
14 73
15 75
15 75
Table 8: Supplemental Oxygen to FiO conversion table used for determining the continuous FiO estimate.

We distinguished between the following cases tor estimating the PaO value at a time-point. As source variables peripheral oxygen saturation (vm20) and PaO from ABGA (vm140

) were used. Hereby the SpO2 variable was pre-smoothed with a percentile (75 % percentile) moving window filter of size 30 min to remove spuriously low outlier measurements.

  • Real PaO measurement: If there was a real PaO (vm140) from an arterial blood gas analysis available in the last 30 minutes, then the estimate was defined directly by the measurement.

  • Ellis estimate from SpO: Otherwise the last measurement of peripheral oxygen saturation measured by pulse oximetry (vm20) was used to compute an estimate of PaO according to Ellis et al. [10]. If there is no real SpO measurement in the last 24h, a default value of 98 % was assumed for the estimation.

First, the resulting PaO estimate was smoothed with a Nadaraya-Watson kernel smoother with a bandwidth of 20 min. Then, each so-derived PaO estimate was weighted by a squared exponential kernel and converted to the closest plausible PaO measurement. The scale of the kernel function was chosen such that a1h distance with the measurement time resulted in a weight of .

The time series was labeled for respiratory failure by using forward-facing windows anchored at time-point . A patient is considered as being in respiratory failure at if, for at least 2/3 time-points of the next 2h, either of the following conditions hold:

  • The estimated P/F ratio is mmHg and the patient is not on mechanical ventilation OR

  • The estimated P/F ratio is mmHg and the patient is on mechanical ventilation, but the PEEP value is not densely measured (no real measurement in 30 min) OR

  • The estimated P/F ratio is mmHg and the patient is on mechanical ventilation, and the PEEP is densely measured and the PEEP is mmHg.

After the definition of event periods, their edges were corrected according to estimates of the P/F ratio at the given time points. If before the event P/F mmHg, the event is further extended in the past. Likewise, if the P/F mmHg condition holds after the event, it is extended into the future.

Finally, small events shorter than 4h, preceded and succeeded by two stability periods, of which one is longer than 4h are deleted. This is because these likely represent intermittent/spurious instances of respiratory failure. Conversely, short stability periods shorter than 4h, preceded and followed by periods of respiratory failure, of which one is longer than 4h, are defined as respiratory failure. Likely, during these periods, the patient does not recover from the failure in a clinical sense.

Statistics on annotated failure events and labels

Below we display statistics on the data resulting from the endpoint stage, and from the label stage, which was used directly as inputs for the machine learning models.

Respiratory failure

Respiratory failure
Patients with events 83.08 % 28,157 admissions
Per-time prevalence of resp. failure 61.83 %
Median # events per patient (if 1 event) [IQR] 1 [1-2]
Median event duration [IQR] 9.5 [4.1-20.6] hours
Median time to first event (if 1 event) [IQR] 1.58 [0-11] hours
Median gap length between two events [IQR] 4.4 [2.1-11] hours
Overall label prevalence (Stable->Failure 12h) 8.6 %
Median label prevalence p/patient (if 1 event) [IQR] 0 [0-60.1] %
Table 9:

Summary statistics about annotated respiratory failure (oxygenation failure with a P/F ratio <300 mmHg) events in the pre-processed data-set. The label prevalence refers to the observed probability of developing respiratory failure in the next 12h, given that the patient is currently stable.

Circulatory failure

Circulatory failure
Patients with events 25.58 % 8,669 admissions
Per-time prevalence of circ. failure 6.70 %
Median # events per patient (if 1 event) [IQR] 1 [1-2]
Median event duration [IQR] 2.9 [1.1-7.6] hours
Median time to first event (if 1 event) [IQR] 1.9 [0.5-7.3] hours
Median gap length between two events [IQR] 3.3 [1.3-9.6] hours
Overall label prevalence (Stable->Failure 12h) 1.4 %
Median label prevalence p/patient (if 1 event) [IQR] 3.8 [0-19.1] %
Table 10: Summary statistics about annotated circulatory failure (elevated lactate and abnormally low MAP or vasopressive agents) events in the pre-processed data-set. The label prevalence refers to the observed probability of developing circulatory failure in the next 12h, given that the patient is currently stable.

Mortality prediction / Patient phenotyping

Percentage of ICU stays with length >24h 44.1 % 14,962 admissions
Mortality prediction
Overall label prevalence (Mortality at 24 hours) 8.38 %
Patient phenotyping
Prevalence ’Neurologic’ (4) 23.6 % 3,511 admissions
Prevalence ’(Cardio)vascular’ (1) 15.9 % 2,367 admissions
Prevalence ’(Cardio)vascular surgical (11) 12.4 % 1,848 admissions
Prevalence ’Respiratory’ (2) 11.0 % 1,635 admissions
Prevalence ’Neurologic surgical’ (14) 7.4 % 1,106 admissions
Prevalence ’Trauma not surgical’ (7) 6.7 % 990 admissions
Prevalence ’Gastrointestinal’ (3) 6.3 % 930 admissions
Prevalence ’Other medical disease’ (6) 5.2 % 778 admissions
Prevalence ’Gastrointestinal surgical’ (13) 4.8 % 713 admissions
Prevalence ’Metabolic/Endocrinology’ (8) 2.3 % 344 admissions
Prevalence ’Respiratory surgical’ (12) 1.7 % 246 admissions
Prevalence ’Other surgical’ (17) 1.2 % 181 admissions
Prevalence ’Trauma surgical’ (15) 1.0 % 148 admissions
Prevalence ’Hematologic’ (9) 0.4 % 57 admissions
Prevalence ’Renal surgical’ (16) 0.2 % 27 admissions
Table 11: Summary statistics about the mortality prediction and patient phenotyping tasks defined at 24h after ICU admission. The last column shows the number of admissions assigned this label.

Kidney function / Remaining-length-of-stay regression

Figure 4: Marginal distribution of the urine regression labels. The vertical red line denotes the mean, and the vertical green line the median of the distribution.
Figure 5: Marginal distribution of the remaining length-of-stay regression labels. The vertical red line denotes the mean, and the vertical green line the median of the distribution.

In this section we provide a description of the full pipeline of the HiRID-ICU Benchmark and go through the main steps in detail. We also address the ML Reproducibility checklist questions111111ML Reproducibility checklist.

  1. Data loading. Request access and then download the data from (see how to request access to the data in the README section of the Software Repository121212 The data is available under the PhysioNet Credentialed Health Data License 1.5.0.

  2. Data preprocessing. The preprocessing step is described in detail in the Appendix A: Dataset Details

    . It is composed of several stages: merge stage, resample stage, feature extraction stage.

  3. Task implementation. In this stage we construct the state annotations and machine learning labels for our predefined set of tasks.

  4. Model training and evaluation. We select and train both ML and DL models on the training and validation sets. We evaluate model performance on the test set.

The script in the icu_benchmarks folder unites three stages of the pipeline which can be used individually: Preprocessing (Section B.1), Task implementation (Section B.2), and Evaluation (Section B.3). Each stage has an associated help function. Below we describe each stage in detail.


The method preprocess unites all necessary steps to generate the input for the Deep Learning (DL) and Machine Learning (ML) models:

  1. run_merge_step —  Converts ICU data from table format to a matrix format that can be input to the machine learning model (see details in Appendix A: Dataset Details).

  2. run_resample_step — Re-samples data to a regularly sampled 5min resolution.

  3. run_feature_extraction_step — Hand-engineered feature extraction for classical machine learning models.

  4. run_build_dl — Data splitting and preprocessing specific to ML.

Running these steps requires the following arguments:

  1. hirid-data-root — Path to the unpacked parquet files containing the HiRID data, downloaded from Physionet.

  2. work-dir — Path to the working directory of the user.

  3. var-ref-path — Path to the file containing meta-data about variables necessary for our pre-processing pipeline.

  4. split-path — Path to the exact split of the data for model training and evaluation.

  5. nr-workers — Number of workers to use during training. It depends on the user hardware capacity.

To run the whole preprocessing, the command below is used:

2icu-benchmarks preprocess --hirid-data-root [path to unpacked parquet files] \
3                          --work-dir [output directory] \
4                          --var-ref-path ./preprocessing/resources/varref.tsv \
5                          --split-path ./preprocessing/resources/split.tsv \
6                          --nr-workers 8

Task Implementation

The second stage of the pipeline, consisting of extracting task labels, is constructed around the following three steps:

  1. imputation_for_endpoints — Imputes data for endpoint generation.

  2. generate_endpoints — Generates the endpoints i.e. verify if conditions for events are matched at each timestep.

  3. generate_labels — Extract final labels from endpoints.

As mentioned in the paper, we construct a set of 6 tasks, relevant for healthcare workers:

  1. Mortality_At24Hours — Mortality prediction at 24h after admission in the ICU (further Mortality).

  2. Phenotyping_APACHEGroup — Patient APACHE group classification 24h after admission in the ICU (further Patient Phenotyping).

  3. Dynamic_CircFailure_12Hours — Continuous prediction of occurrence of circulatory failure in the next 12 hours (further Circulatory Failure).

  4. Dynamic_RespFailure_12Hour — Continuous prediction of occurrence of respiratory failure in the next 12 hours (further Respiratory Failure).

  5. Remaining_LOS_Reg — Continuous prediction of the remaining stay duration (further Remaining LOS).

  6. Dynamic_UrineOutput_2Hours_Reg — Continuous prediction of patient urine production in the next 2h (further Kidney Function).

Model Training and Evaluation

In the final part of the pipeline, we provide the user with a choice of model to train:

  1. from the list of DL models: Transformer, LSTM, GRU and Temporal CNN. All these models use the class DLWrapper defining methods mechanics for deep learning models’ training and evaluation.

  2. from the list of the ML Models: Gradient Boosting method and Logistic Regression. In the same manner as DL models, these models have a MLWrapper class.

All models are trained on the training set. The validation set is used for early stopping and model selection through a random search. After the training procedure, the model performance is evaluated on the test set.

To run the training, the following command is used (as an example, we provide here the command for GRU model training for the Dynamic_CircFailure_12Hours task):

1icu-benchmarks train -c configs/hirid/Classification/GRU.gin \
2                     -l [path to logdir] \
3                     -t Dynamic_CircFailure_12Hours\
4                     -sd 1111
  • Argument -l specifies the path to the directory where the trained model and meta-data will be stored.

  • Argument -t specifies the selected task.

  • Argument -sd specifies the random seed.

  • Argument -c specifies path to gin-config configuration file. This file should include the path to the data after preprocessing, the task, the model with all hyper-parameters.

An example of the configuration file for the GRU model is provided below131313See examples of configuration files for all the tasks and models in the config folder of the software repository.

1import gin.torch.external_configurables
2import icu_benchmarks.models.wrappers
3import icu_benchmarks.models.encoders
4import icu_benchmarks.models.utils
8EMB = 231
9LR = 3e-4
10HIDDEN = 64
12DEPTH = 1
13BS = 64
14EPOCHS = 1000
15TASK = ’Dynamic_CircFailure_12Hours’
16RES = 1
17RES_LAB = 1
18MAXLEN = 2016
21# Train params
22train_common.model = @DLWrapper()
23train_common.dataset_fn = @ICUVariableLengthDataset
24train_common.data_path = [path to data] # TODO add by user
25train_common.weight = train_common.do_test = True
27DLWrapper.encoder = @GRU()
28DLWrapper.loss = @cross_entropy
29DLWrapper.optimizer_fn = @Adam
30DLWrapper.train.epochs = DLWrapper.train.batch_size = DLWrapper.train.patience = 10
31DLWrapper.train.min_delta = 1e-4
34ICUVariableLengthLoaderTables.splits = [’train’,’test’,’val’]
35ICUVariableLengthLoaderTables.task = ICUVariableLengthLoaderTables.data_resampling = ICUVariableLengthLoaderTables.label_resampling = ICUVariableLengthDataset.maxlen =
37# Optimizer params = Adam.weight_decay = 1e-6
40# Encoder params
41GRU.input_dim = GRU.hidden_dim = GRU.layer_dim = GRU.num_classes = \end{lstlisting}
43We define all macros at the top of the file and use them to configure some classes and functions. Among them, we have two custom function for data loading:
45    \item \texttt{ICUVariableLengthLoaderTables}~--- Main Loader class allowing to sample patient from the data. \\
46    \item \texttt{ICUVariableLengthDataset}~--- \texttt{pytorch} dataset wrapper to ship data to the GPU.
51\section*{Technical Specifics for Reproducibility}\label{appsec:detail-training}
52\paragraph{Libraries} A full list of libraries and the version we used is provided in the \texttt{environment.yml} file. The most important ones are the following: pytorch 1.8.1, scikit-learn 0.24.1, ignite 0.4.4, CUDA 10.2.89, cudNN 7.6.5.
55We follow all guidelines provided by \texttt{pytorch} documentation to ensure reproducibility of our results. However, reproducibility across devices is not ensured. Thus we provide here the characteristics of our infrastructure. We trained deep learning methods on a single \texttt{NVIDIA RTX2080Ti} with a \texttt{Xeon E5-2630v4} core. For other methods we trained models on either \texttt{Xeon E5-2697v4} cores or \texttt{Xeon Gold 6140} cores.
59For the main experiment, 10 different random initializations were used for each model; For the ablation study, 5 were used. For all models, we tuned specific hyper-parameters using random search with 100 iterations for stay-level tasks and 50 iterations for the dynamic ones. Each random set of parameters was run with 3 different random initializations. Early stopping with 10 step patience on the loss was used as a stopping criterion. We then chose hyper-parameters on AUPRC, balanced accuracy, and MAE for respectively, binary, multi-class, and regression tasks.
61\paragraph{Complexity of training}
62Among the deep learning methods, transformer memory complexity, with regard to the sequence length, is quadratic. This has forced us to reduce both batch size and the number of parameters of this model for the dynamic tasks. Also, to ensure reproducibility, using deterministic algorithms particularly slows down TCN training. With our hardware, training deep learning methods takes less than 1h for stay-level tasks and less than 6h for dynamic ones on a single GPU.
63To reduce RAM consumption we provide the possibility to load data only at inference time with parameter \texttt{on\_RAM} in our loader. In that case, training is slightly slower but requires only around 8GB of RAM.
65On the other hand, ML methods are faster to train but more RAM-consuming. Training any model takes less than 4h on 4 CPUs. However, peak memory can exceed 100GB when using hand-engineered features.
67\section*{Hyperparameters Search}\label{appsec:hyper-parameters}
68In this section we detail the range of hyperparameters we searched over and the one we used for our experiments.
69\subsection*{Common Hyperparameters}
70For the training of DL methods, we used certain parameters across multiple tasks and architecture as reported in Table \ref{tab:fixed-hp-dl}.
71For the ML methods, we fixed certain parameters as in the original HiRID paper. For LGBM, we set the bagging frequency to 1,
72the number of leaves to $2^\textbf{depth}$,  and the minimum number of children per leaf to 1000. In the rest of the section, we report the hyperparameters we searched over in our experiments.
75    \centering
76    \begin{tabular}{l|c|c|c|c}
77        \toprule
78        Models & Optimizer  & Weight Decay & Batch Size Online & Batch Size Stay Level  \\
79        \midrule
80        LSTM & Adam  & 1e-6 & 64  & 64 \\
81        GRU & Adam  & 1e-6 & 64  & 64 \\
82        TCN & Adam  & 1e-6 & 64   & 64\\
83        Transformer & Adam  & 1e-6 & 8   & 16 \\
84        \bottomrule
85        \end{tabular}
86    \caption{Fixed Hyperparameters for DL Methods }
87    \label{tab:fixed-hp-dl}
91\subsection*{Deep Learning model Hyperparameters}
92In this section, we detail the range of hyperparameters considered for LSTM, GRU, TCN and transformer models.
94The range of hyperparameters considered for the LSTM Model can be found in Table \ref{tab:hp-search-lstm}.
98    \centering
99    \footnotesize
100    \setlength\tabcolsep{2pt}
106Task &      Learning Rate &  Drop-out & Depth & Hidden Dimension & Loss Weighting \\
110Mortality & (1e-5, 3e-5, \textbf{1e-4}, 3e-4) &  (0.0, \textbf{0.1}, 0.2, 0.3, 0.4) &     (\textbf{1}, 2, 3) &    (32, 64, \textbf{128}, 256) & (\textbf{None}, Balanced) \\
111Phenotyping & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) &     (\textbf{1}, 2, 3) &    (32, 64, 128, \textbf{256}) & (None, \textbf{Balanced}) \\
115Circ. Failure & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, 0.1, \textbf{0.2}, 0.3, 0.4) &     (1, 2, \textbf{3}) &    (32, 64, 128, \textbf{256}) & (\textbf{None}, Balanced) \\
116Resp. Failure & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, \textbf{0.1}, 0.2, 0.3, 0.4) &     (1, 2, \textbf{3}) &    (32, 64, \textbf{128}, 256) & (\textbf{None}, Balanced) \\
121Urine Output & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, 0.1, 0.2, \textbf{0.3}, 0.4) &     (1, 2, \textbf{3}) &    (32, 64, \textbf{128}, 256) & N.A \\
122Rem. LOS & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, 0.1, \textbf{0.2}, 0.3, 0.4) &     (1, 2, \textbf{3}) &    (32, 64, 128, \textbf{256}) & N.A \\
125    \caption{Hyperparameter search range for LSTM. In \textbf{bold} are the parameters we selected using random search.}
126    \label{tab:hp-search-lstm}
131\paragraph{GRU}The range of hyperparameters considered for the GRU Model can be found in Table \ref{tab:hp-search-gru}.
133    \centering
134    \footnotesize
135    \setlength\tabcolsep{2pt}
140Task &      Learning Rate &  Drop-out & Depth & Hidden Dimension & Loss Weighting \\
144Mortality & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) &     (1, \textbf{2}, 3) &    (32, \textbf{64}, 128, 256) & (\textbf{None}, Balanced) \\
145Phenotyping & (\textbf{1e-5}, 3e-5, 1e-4, 3e-4) &  (0.0, 0.1, 0.2, 0.3, \textbf{0.4}) &     (\textbf{1}, 2, 3) &    (32, 64, 128, \textbf{256}) & (None, \textbf{Balanced}) \\
149Circ. Failure & (1e-5, 3e-5, \textbf{1e-4}, 3e-4) &  (0.0, \textbf{0.1}, 0.2, 0.3, 0.4) &     (1, 2, \textbf{3}) &    (32, 64, 128, \textbf{256}) & (\textbf{None}, Balanced) \\
150Resp.Failure & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) &     (1, 2, \textbf{3}) &    (32, \textbf{64}, 128, 256) & (None, Balanced) \\
155Urine Output & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, 0.1, 0.2, \textbf{0.3}, 0.4) &     (1, 2, \textbf{3}) &    (32, 64, 128, \textbf{256}) & N.A \\
156Rem. LOS & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, 0.1, 0.2, \textbf{0.3}, 0.4) &     (1, \textbf{2}, 3) &    (32, 64, \textbf{128}, 256) & N.A \\
159    \caption{Hyperparameter search range for GRU. In \textbf{bold} are the parameters we selected using random search. }
160    \label{tab:hp-search-gru}
164\paragraph{TCN}The range of hyperparameters considered for the TCN Model can be found in Table \ref{tab:hp-search-tcn}. Note that we do not consider any depth factor as it is fully determined by the kernel size and the sequence length.
166    \centering
167    \footnotesize
168    \setlength\tabcolsep{2pt}
173Task & Learning Rate &  Drop-out & Kernel & Hidden Dimension & Loss Weighting \\
177Mortality & (1e-5, 3e-5, \textbf{1e-4}, 3e-4) &  (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) &    (2, \textbf{4}, 8, 16, 32) &    (32, 64, 128, \textbf{256}) & (\textbf{None}, Balanced) \\
178Phenotyping  & (1e-5, 3e-5, \textbf{1e-4}, 3e-4) &  (0.0, 0.1, \textbf{0.2}, 0.3, 0.4) &  (2, 4, 8, 16, \textbf{32}) &    (32, 64, \textbf{128}, 256) &
179(None,\textbf{Balanced}) \\
183Circ. Failure & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, \textbf{0.1}, 0.2, 0.3, 0.4) &   (2, \textbf{4}, 8, 16, 32) &    (\textbf{32}, 64, 128, 256) & (\textbf{None}, Balanced) \\
184Resp. Failure & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) &    (\textbf{2}, 4, 8, 16, 32) &    (32, 64, 128, \textbf{256}) & (\textbf{None}, Balanced) \\
189Urine Output & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, 0.1, \textbf{0.2}, 0.3, 0.4) &   (2, 4, \textbf{8}, 16, 32) &    (32, 64, 128, \textbf{256}) & N.A \\
190Rem. LOS & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, 0.1, 0.2, \textbf{0.3}, 0.4) &     (2, 4, 8, 16, \textbf{32}) &    (32, 64, \textbf{128}, 256) & N.A \\
193    \caption{Hyperparameter search range for TCN. In \textbf{bold} are the parameters we selected using random search. }
194    \label{tab:hp-search-tcn}
198\paragraph{Transformer} The range of hyperparameters considered for Transformer Model can be found in Table \ref{tab:hp-search-transformer} and Table \ref{tab:hp-search-transformer_2}. We considered smaller parameters for online tasks due to GPU memory limitations.
200    \centering
201    \footnotesize
205Task & Learning Rate &  Drop-out & Nb. Heads & Depth  \\
209Mortality & (\textbf{1e-5}, 3e-5, 1e-4, 3e-4) &  (0.0, \textbf{0.1}, 0.2, 0.3, 0.4) &    (1, 2, \textbf{4}, 8) &   (1, \textbf{2}, 3)  \\
210Phenotyping & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, 0.1, \textbf{0.2}, 0.3, 0.4) &   (1, 2, 4, \textbf{8}) &  (1, \textbf{2}, 3) \\
214Circ. Failure & (1e-5, 3e-5, \textbf{1e-4}, 3e-4) &  (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) & (1, \textbf{2}, 4) & \textbf{1}  \\
215Resp. Failure & (1e-5, 3e-5, \textbf{1e-4}, 3e-4) &  (0.0, 0.1, 0.2, \textbf{0.3}, 0.4) &  (\textbf{1}, 2, 4) & \textbf{1}  \\
220Urine Output & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (0.0, \textbf{0.1}, 0.2, 0.3, 0.4) &   (\textbf{1}, 2, 4, 8) & \textbf{1} \\
221Rem. LOS & (1e-5, 3e-5, 1e-4, \textbf{3e-4}) &  (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) &    (\textbf{1}, 2, 4, 8)&\textbf{1} \\
224    \caption{Hyperparameter search range for the Transformer. In \textbf{bold} are the parameters we selected using random search. }
225    \label{tab:hp-search-transformer}
230    \centering
231    \footnotesize
235Task & Attention Drop-out & Hidden Dimension & Loss Weighting \\
238Mortality & (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) &  (32, 64, 128, \textbf{256}) & (\textbf{None}, Balanced) \\
239Phenotyping & (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) &  (\textbf{32}, 64, 128, 256) & (None, \textbf{Balanced}) \\
242Circ. Failure & (0.0, 0.1, 0.2, \textbf{0.3}, 0.4) &  (32, \textbf{64}, 128) & (\textbf{None}, Balanced) \\
243Resp. Failure & (0.0, 0.1, \textbf{0.2}, 0.3, 0.4) &  (32, 64, \textbf{128}) & (\textbf{None}, Balanced) \\
246Urine Output & (0.0, \textbf{0.1}, 0.2, 0.3, 0.4) &  (32, \textbf{64}, 128) & N.A \\
247Rem. LOS  &  (\textbf{0.0}, 0.1, 0.2, 0.3, 0.4) & (32, 64, \textbf{128})  & N.A \\
250    \caption{Hyperparameter search range for Transformer. In \textbf{bold} are the parameters we selected using random search. }
251    \label{tab:hp-search-transformer_2}
256\subsection*{Machine Learning Models Hyperparameters}
257\paragraph{Gradient Boosting} The range of hyperparameters considered for the gradient boosting method, LightGBM framework\footnote{\url{}} can be found in Table \ref{tab:hp-search-gb} and \ref{tab:hp-search-gb-feat} :
259    \centering
263Task & Depth &  Colsample\_bytree\tablefootnote{Subsample ratio of columns when constructing each tree.} & Subsample\tablefootnote{Subsample ratio of the training instance} \\
266Mortality & (3, 4, 5, 6, \textbf{7})  & (0.33, 0.66, \textbf{1.00}) & (\textbf{0.33}, 0.66, 1.00) \\
267Phenotyping & (\textbf{3}, 4, 5, 6, 7) & (0.33, \textbf{0.66}, 1.00)& (0.33, \textbf{0.66}, 1.00) \\
270Circulatory Failure &(3, \textbf{4}, 5, 6, 7) & (0.33, 0.66, \textbf{1.00})& (0.33, \textbf{0.66},1.00) \\
271Respiratory Failure & (3,4, \textbf{5}, 6, 7) & (0.33, 0.66, \textbf{1.00})& (0.33, \textbf{0.66},1.00) \\
274Urine Output & (3,\textbf{4}, 5, 6, 7) & (0.33, 0.66, \textbf{1.00})& (0.33, 0.66, \textbf{1.00}) \\
275Remaining Length-of-Stay & (3, 4, 5, 6, \textbf{7}) & (\textbf{0.33}, 0.66, 1.00)& (0.33, 0.66, \textbf{1.00}) \\
278    \caption{Hyperparameter search range for LGBM. In \textbf{bold} are the parameters we selected using random search.}
279    \label{tab:hp-search-gb}
284    \centering
288Task & Depth &  Colsample\_bytree\tablefootnote{Subsample ratio of columns when constructing each tree.} & Subsample\tablefootnote{Subsample ratio of the training instance} \\
291Mortality & (3, 4, 5, 6, \textbf{7})  &(0.33, 0.66, \textbf{1.00}) & (0.33, 0.66, \textbf{1.00}) \\
292Phenotyping & (3, 4, \textbf{5}, 6, 7) & (\textbf{0.33}, 0.66, 1.00)& (\textbf{0.33}, 0.66, 1.00) \\
295Circulatory Failure &(\textbf{3}, 4, 5, 6, 7) & (0.33, \textbf{0.66}, 1.00)& (\textbf{0.33}, 0.66, 1.00) \\
296Respiratory Failure &(3, 4, 5, \textbf{6}, 7) &(0.33, 0.66, \textbf{1.00}) & (0.33, 0.66, \textbf{1.00}) \\
299Urine Output&(3, 4, 5, \textbf{6}, 7) &(0.33, 0.66, \textbf{1.00}) & (0.33, 0.66, \textbf{1.00}) \\
300Remaining Length-of-Stay & (3, 4, 5, 6, \textbf{7}) & (0.33, \textbf{0.66}, 1.00)& (\textbf{0.33}, 0.66, 1.00) \\
303    \caption{Hyper-parameters range for LGBM w. features. In \textbf{bold} are the parameters we selected using random search.}
304    \label{tab:hp-search-gb-feat}
307\paragraph{Logistic Regression}
308The search range of hyperparameters considered for Logistic Regression\footnote{scikit-learn framework} can be found in Table \ref{tab:hp-search-lr}:
310    \centering
314Task & C &  Penalty \\
317Mortality & (0.001, 0.01, 0.1, \textbf{1}, 10) & (’l1’, \textbf{’l2’})   \\
318Phenotyping &(0.001, 0.01, \textbf{0.1}, 1, 10) & (’l1’,  \textbf{’l2’}) \\
321Circulatory Failure &(0.001, 0.01, 0.1, 1, \textbf{10}) & (’l1’,  \textbf{’l2’}) \\
322Respiratory Failure &(0.001, 0.01, \textbf{0.1}, 1, 10) &(’l1’, \textbf{’l2’})   \\
325    \caption{Hyperparameter search range for Logistic Regression. In \textbf{bold} are the parameters we selected using random search.}
326    \label{tab:hp-search-lr}