Severely ill patients require treatment and surveillance in Intensive Care Units (ICU). Critical health conditions are characterized by the presence or risk of developing life-threatening organ dysfunction. During a patient’s stay in the ICU continuous monitoring of organs function parameters, enables early recognition of physiological deterioration and rapid commencement of appropriate interventions. Recent research shows the great success of machine learning methods when applied to ICU time series ncle ; horn2020set . One of the main goals of previous work was to develop new methods for prediction tasks relevant for clinical decision-making. Exemplary of such tasks are alarm systems that predict different types of organ failure hyland2020early ; tomavsev2019clinically .
To develop and evaluate such methods only a small number of large-scale datasets are freely-accessible: The MIMIC-III johnson2016mimic and IV MIMIC-IV datasets, AmsterdamUMCdb ams , HiRID (hirid, ) and the eICU Collaborative Research Database pollard2018eicu . However, these datasets are not provided in a pre-processed form directly suitable for machine learning nor do they have well-defined tasks, making it impossible to fairly compare works johnson2017reproducibility . While some pre-processed alternatives with well-defined tasks exist (goldberger2000physiobank, ; reyna2019early, ), they are often lacking in terms of size and diversity of tasks. We provide more detail about this in section 2. This leads to situations where works compare methods on their private data (tomavsev2019clinically, ) or only on limited data and number of tasks. Also the lack of relevant clinical sub-tasks for benchmarking stops the development of new methods for clinical decision support systems harutyunyan2019multitask . Finally, as in other fields, with recent datasets such as HiRID (hirid, ) the time resolution of data has greatly increased. However, no benchmark on ICU time series using such high-resolution datasets currently exists.
To improve this situation, in this paper we provide an in-depth benchmark based on the HiRID dataset (hirid, ; hyland2020early, )111https://physionet.org/content/hirid/1.1.1/, which was released on Physionet hirid alongside the publication on the circulatory Early Warning Score (circEWS) hyland2020early . HiRID is a freely accessible critical care dataset containing data recorded at the Department of Intensive Care Medicine, Bern University Hospital, Switzerland (ICU). The dataset was developed in cooperation with the Swiss Federal Institute of Technology (ETH), Zürich, Switzerland. We define a new benchmark on HiRID composed of various clinically relevant tasks and provide a comprehensive pipeline, which includes all steps from preprocessing to model evaluation. To assess the different aspects of the benchmarked machine learning methods, we diversify the tasks around specific challenges of ICU data such as prediction frequency, class imbalance, or organ dependency of the task. To profit from data acquisition advances and allow improvement on longer time series, we use a resampled data resolution of 5 min. HiRID has a higher time resolution than any other published critical care dataset and it motivates us to provide a comprehensive benchmark suite on this data-set. Also, we believe that this dataset will motivate the construction of new predictive methods for the healthcare field, going beyond ICU time series.
The main contributions of this paper are:
We developed a comprehensive, end-to-end pipeline for time-series analysis of critical care data based on the recently published HiRID dataset. This pipeline includes the following stages: data preprocessing mode, training mode, and evaluation mode.
We proposed and implemented a variety of tasks relevant to healthcare workers in the ICU, diversified in terms of type, prediction resolution, and label prevalence. The tasks cover all major organ systems as well as the general patient state. We included regression and classification (binary and multi-class) tasks.
By providing a comprehensive benchmark on a set of canonical tasks, we give the research community around predictive modeling on ICU time series the possibility for the clear comparison of their methods.
The paper is organized as follows: in Section 2 we provide an overview of existing ICU datasets and benchmarking papers. We provide details about the HiRID dataset and introduce the tasks defined in collaboration with clinicians in Section 3 and give more details on the tasks in Appendix A: Dataset Details. Section 4 describes the pipeline design, with more details given in Appendix B: HiRID-ICU-Pipeline Details. Section 5 describes the experiment and ablation study. In Section 6 we discuss the observed results and relate this paper to other benchmarks and related tasks relevant for clinicians.
2 Related Work
The main goal of this work is to provide a benchmark on the HiRID dataset for various clinical prediction tasks of interest. We describe here other ICU datasets as well as existing benchmarks for ICU data.
ICU time-series datasets
There are several widely-used, freely-accessible datasets consisting of ICU time series. MIMIC-IV johnson2016mimic is the oldest and most widely used ICU dataset available. It consists of physiological measurements as well as information about laboratory analyses. Physiological measurements are recorded with a maximum resolution of 1 hour. The results of laboratory analysis are collected at irregular time intervals. Moreover, there are static features like gender, age, diagnosis, etc. available. The dataset consists of information recorded about 40,000 ICU stays at Beth Israel Deaconess Medical Center (BIDMC), Boston, MA, USA. The median of the patient stay length is 2 days. The eICU Collaborative Research Database pollard2018eicu is a large multicenter critical care database made available by Philips Healthcare in partnership with the MIT Laboratory for Computational Physiology. It contains data associated with over 200,000 patient stays, but the public version does not reach the granularity of other datasets in terms of time resolution and data elements. The first version of AmsterdamUMCdb ams was released in November 2019. Its current version from March 2020 contains data related to 23,172 ICU and high dependency unit admissions of adult patients from 2003 - 2016 from Amsterdam University Medical Centers. The data includes clinical observations like vital signs, clinical scores, device data, and lab results.
Benchmarks on ICU time-series.
Among works using the openly available datasets mentioned above, to the best of our knowledge, only a single standardized benchmark exists, MIMIC-III benchmark by Harutyunyan et al. harutyunyan2019multitask . In that work four tasks were proposed, two requiring one prediction per patient stay and two dynamic tasks with one prediction per hour each. In addition, while they do not propose a benchmark, Jarrett et al. jarrettclairvoyance developed a standardized pipeline for medical time series, called Clairvoyance. Results were shown on several datasets, including MIMIC. In this spirit, some packages address a specific family of tasks, for example, classification faouzi2020pyts and forecasting guecioueurpysf . Finally, some public challenges, with curated data, were proposed in the past, e.g. for the early prediction of sepsis (Physionet 2019 challenge reyna2019early ) or mortality prediction (Physionet 2012 challenge citi2012physionet ). However, the provided datasets are smaller than HiRID and built around a single task.
3 Benchmark Design
3.1 The HiRID Dataset
HiRID (hirid, ; hyland2020early, ) is a freely accessible critical care dataset containing data from more than 33,000 patient admissions to the Department of Intensive Care Medicine, Bern University Hospital, Switzerland (ICU) from January 2008 to June 2016. It was released on Physionet hirid alongside the publication of the circulatory Early Warning Score (circEWS) hyland2020early . The dataset was developed in cooperation with the Swiss Federal Institute of Technology (ETH) Zürich, Switzerland. It contains de-identified demographic information and a total of 712 routinely collected physiological variables, diagnostic test results, and treatment parameters. HiRID has a higher time resolution than any other published ICU dataset, particularly for bedside monitoring, with most parameters recorded every 2 minutes, which motivates us to provide a comprehensive benchmark suite on this dataset. Demographic information about the patient cohort are displayed in Appendix Table 1.
|Task name||Task type||Task description|
|ICU mortality||Binary classification, one prediction per stay||Predicted at 24h after admission to the ICU.|
|[0.3pt/1pt] Patient phenotyping||Multi-class classification, one prediction per stay||Classifying the patient after 24h regarding the admission diagnosis, using the APACHE group II and IV labels222APACHE II and IV Zimmerman2006-of ; Knaus1985-iw are subsequent versions of the major illness severity score used in the ICU. They also introduce a patient grouping according to admission reason. We use an aggregate of these two groupings for this task (see Appendix A: Dataset Details)|
|Circulatory failure 333Circulatory failure is defined as Lactate mmol/l and either mean arterial blood pressure mmHg or administration of any vasoactive drug.||Binary classification, dynamic prediction throughout stay||Continuous prediction of onset of circulatory failure in the next 12h, given the patient is not in failure now.|
|[0.3pt/1pt] Respiratory failure444Respiratory failure is defined according to the Berlin definition (ARDS_Definition_Task_Force2012-hl, ) as a P/F ratio mmHg.||Binary classification, dynamic prediction throughout stay||Continuous prediction of onset of respiratory failure in the next 12h, given the patient is not in failure now.|
|Kidney function||Regression, dynamic prediction throughout stay||Continuous prediction of urine production in the next 2h as an average rate in ml/kg/h. The task is predicted at irregular intervals.|
|[0.3pt/1pt] Remaining length of stay||Regression, dynamic prediction throughout stay||Continuous prediction of the remaining ICU stay duration.|
3.2 Prediction Tasks
Our benchmark suite focuses on clinically relevant prediction tasks with a large diversity in the machine learning task types. The tasks cover most major organ systems as well as the general patient state. The major organ systems include the cardiovascular, kidney, and respiratory systems. For each organ system, we provide a prediction task related to the main organ function. Length of stay, mortality, and patient phenotyping are chosen to represent tasks regarding an overall patient state. From a machine learning point of view, our suite contains regression and classification (binary and multi-class) tasks. We included tasks with different degrees of class imbalance to diversify the spectrum further and enable the comparison of methods on e.g. highly imbalanced tasks. We chose tasks performed online throughout the stay (every 5 minutes) and at fixed times of the stay, such as 24h after ICU admission, which capture a more long-term state of the patient. To enhance reproducibility, we include some tasks previously considered in harutyunyan2019multitask , mortality, and remaining length-of-stay prediction. Table 1 contains the full task descriptions.
4 Pipeline Design
Figure 1 shows an overview of the major HiRID-ICU pipeline steps. The pipeline is designed using the preprocess-train-predict paradigm. We provide more details about it in Appendix B: HiRID-ICU Pipeline Details and the README section of the software repository555https://github.com/ratschlab/HIRID-ICU-Benchmark. The preprocessed data contains two data versions, common_stage and ml_stage
. The first is independent of modeling choices and serves as the starting point for future works with custom pre-processing choices. The latter is a compatible version for our pipeline with our categorical encoding, imputation, and scaling choices.
4.1 Data Pre-processing
In its public version, HiRID, as any real-world dataset, contains certain artifacts that require pre-processing. As pointed out by (bellamy2020evaluating, ) for MIMIC-III, individual pre-processing in each work avoids a fair comparison of them. To this effect, we aim to provide a modular and reproducible pipeline. Patient EHRs in HiRID are stored in a long table format where each row of the table is a record containing the measurement value of a specific variable at a specific time for a patient, which cannot be used as a ready input for machine learning models.
To obtain a more compact format, the first pre-processing step in our pipeline is to transform the long table of patient EHRs into feature matrices, where each column represents a clinical concept, which we call wide format. Such a data format represents an irregularly sampled multivariate time series. At this step, we also remove any physiologically impossible measurement.
After this merging step, we further compact the dataset by re-sampling it to a 5 minute resolution. Thus, each time step contains the last value measured in the last 5 min if it exists, or is empty otherwise. This gridding is similar to the strategy used by (hyland2020early, ). We refer to the output of this step as the common_stage in Fig.1. Because it is independent of modeling choices, this stage provides a starting point for future approaches using different imputation and scaling choices.
Processing for Machine Learning
In the second part of the pipeline, we process the common stage of the data to be compatible with ML models’ expected input format. For this, we first use forward-filling imputation for each stay. Then, we apply one-hot encoding for categorical variables and scale the remaining ordinal or continuous variables. We standard scaled all variables with the exception of the time since admission and admission date, which we min-max scaled. By doing scaling globally, we ensure to preserve patients’ specificity (e.g.: tachycardia). We refer to the output of this stage as theml_stage as it is dependent on our modeling choices.
4.2 Hand-engineered Feature Extraction
In the original paper describing the HiRID dataset (hyland2020early, ), the authors showed that boosted tree ensembles such as LGBM (ke2017lightgbm, ), when provided with hand-engineered features, outperform state-of-the-art deep learning methods. Based on this observation, we include in our pipeline the possibility to extract such features from the common_stage of the data. For our models, we extracted four features for each non-categorical variable over the entire history: minimum past value, maximum past value, mean past value and density of measurement, which is the proportion of time points where a value is provided among all possible time points in the history666This is done on the regularly sampled version of the data. These features are then included in the ml_stage.
4.3 Label Construction and Splitting
We construct prediction task labels using the provided measurements and meta-data for both continuous and stay-level tasks. As an intermediate step for label construction, we use a forward imputed version of the data, as in the modeling stage. Concerning the experimental design, we use a random split by patients. The training set contains 70% of the patients and validation and test sets, 15% each. A temporal splitting strategy as used by Hyland et al. (hyland2020early, ) would be more clinically relevant but information about admission time was removed to preserve anonymity when the dataset was originally published. While longer stays exist in the dataset, for computational reasons, we limited labeling to the first 7 days of stays (2016 steps of 5 min). This cropping affects less than 6% of all stays.
|Task name||Train set||Validation set||Test set|
|Circ. failure||1.4 % (n=14.12M)||1.3 % (n=3.01M)||1.4 % (n=2.96M)|
|Resp. failure||8.7 % (n=5.58M)||8.4 % (n=1.21M)||8.4 % (n=1.20M)|
|Mortality||8.7 % (n=10525)||7.1 % (n=2206)||8.3 % (n=2231)|
|Phenotyping||0.2 % (n=10470)||0.1 % (n=2194)||0.1 % (n=2217)|
|Kidney function||1.17 ml/kg/h (n=341424)||1.12 ml/kg/h (n=71549)||1.18 ml/kg/h (n=70642)|
|Rem. LOS||41.04h (n=15.15M)||41.51h (n=3.22M)||39.64h (n=3.17M)|
4.4 Model Training and Evaluation
The final part of the pipeline contains an end-to-end machine learning suite to train and evaluate our models depicted on the right hand side of Fig.1. ML approaches were implemented using scikit-learnscikit-learn and lightgbmke2017lightgbm , whereas DL approaches were implemented in pytorch paszke2019pytorch . All DL models were trained using Adam optimizer kingma2014adam , with a cross-entropy objective for classification tasks and mean-squared error (MSE) for regression tasks. For classification we provide the possibility to balance loss weights according to class prevalence as in king2001logistic .
For the evaluation of models, we use a range of metrics relevant to each task. For classification tasks, we considered AUROC777https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html and AUPRC888https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html metrics in the binary case, and balanced accuracy (B-Accuracy) brodersen2010balanced in the multi-class one. For regression tasks, we used mean absolute error (MAE) as a comparison metric. Regardless of the task or model, we used the scikit-learn implementation for all metrics. More details about this stage of the pipeline can be found in Appendix B: HiRID-ICU-Pipeline Details.
For all models, we tuned specific hyper-parameters using random search. Each randomly picked set of parameters was run with 3 different random initializations. We then selected hyper-parameters on the validation set performance for either AUPRC, B-Accuracy, or MAE. All models were trained with early stopping on the validation loss. Further details about hyper-parameters can be found in Appendix B: HiRID-ICU-Pipeline Details.
Because of the class imbalance existing in classification tasks, we considered balanced loss weights for all methods. However as further discussed in subsection 5.5, this technique was relevant only for the Patient Phenotyping task. For regression tasks, we min-max scaled the labels at training time to avoid exploding gradients.
5.2 Benchmarked Methods
In our proposed benchmark, we considered two groups of machine learning algorithms. The first group consists of regular machine learning algorithms, which as shown are highly effective for ICU-related tasks (sadeghi2018early, ; harutyunyan2019multitask, ; hyland2020early, )
. It is composed of a Gradient Boosting method with LightGBM(ke2017lightgbm, )hochreiter1997long, ) and GRU (cho2014learning, )
), convolutional neural networks (CNN), in particular, temporal convolutional networks (TCN)bai2018empirical and Transformer models vaswani2017attention .
5.3 Benchmarking Models on High-resolution ICU Data
In this section, we compare the earlier described methods on all tasks. While DL approaches are provided with the entire history for all time points, ML methods use only the values of the current step as an input. Thus one would expect the latter models to perform significantly worse due to the lower amount of information provided.
|Task||ICU Mortality||Patient Phenotyping|
|Metric||AUPRC ()||AUROC ()||B-Accuracy ()|
|LR||58.1 0.0||89.0 0.0||39.1 0.0|
|LGBM||54.6 0.8||88.8 0.2||40.4 0.8|
|LGBM w. Feat.||62.6 0.0||90.5 0.0||45.8 2.0|
|GRU||60.3 1.6||90.0 0.4||39.2 2.1|
|LSTM||60.0 0.9||90.3 0.2||39.5 1.2|
|TCN||60.2 1.1||89.7 0.4||41.6 2.3|
|Transformer||61.0 0.8||90.8 0.2||42.7 1.4|
. In bold are the methods within one standard deviation of the best one. Classification metrics were scaled to 100 for readability purposes.
|Task||Circulatory failure||Respiratory failure||Kidney func.||Remaining LOS|
|Metric||AUPRC ()||AUROC ()||AUPRC ()||AUROC ()||MAE ()||MAE ()|
|LR||35.6 0.0||96.9 0.0||49.3 0.0||91.3 0.0||N.A||N.A|
|LGBM||43.5 0.3||97.7 0.0||54.2 0.3||92.8 0.0||0.45 0.00||56.9 0.4|
|LGBM w. Feat.||43.9 0.6||97.7 0.0||55.7 0.1||93.1 0.0||0.45 0.00||57.0 0.3|
|GRU||41.9 0.3||97.5 0.0||52.8 0.5||92.6 0.1||0.49 0.02||60.6 0.9|
|LSTM||36.4 1.3||96.6 0.2||51.6 0.8||92.3 0.1||0.50 0.01||60.7 1.6|
|TCN||41.5 0.5||97.5 0.0||53.4 0.4||92.7 0.1||0.50 0.01||59.8 2.8|
|Transformer||42.1 0.6||97.6 0.0||53.7 0.4||92.7 0.1||0.48 0.02||59.5 2.8|
When comparing methods on tasks requiring a single prediction after 24h (Table 3), we observe the superiority of LGBM with hand-extracted features. Transformers outperformed other DL methods but we observe a significant performance gap with the best ML method in B-Accuracy for Patient Phenotyping and AUPRC for ICU Mortality. Concerning GRU and LSTM, their performance is similar to TCN’s for ICU Mortality. However, on the Patient Phenotyping task, they do not manage to outperform even logistic regression.
Online Failure Predictions
For the continuous classification tasks, where the maximum sequence length extends from 288 steps to 2016, DL methods do not leverage the additional history information. Indeed, as shown in Table 4, for both Circulatory and Respiratory Failure, LGBM trained only on the current variables outperforms all DL methods. Among these methods, LSTM is the most impacted, as it has noticeably lower scores. Finally, for all continuous tasks, including regression discussed below, the improvement brought by hand-extracted features is not as significant. It suggests that statistical features, when extracted from the entire history, are less informative.
Online Regression Tasks
The final set of tasks we benchmark are regression tasks (Table 4). As for the classification case, LGBM-based methods outperform DL methods, which, among them, have similar performance. In addition, we do not observe any improvement brought by our selection of hand-extracted features. Moreover, the overall performances of the proposed methods are relatively low. While a MAE of ml/kg/h for Kidney Function is only two times smaller than the median urine output rate, a h error in Remaining LOS is more than two times the median length-of-stay. We believe these low scores are due to the nature of the labels’ distributions, which are both heavy-tailed as shown in Appendix A: Dataset Details.
5.4 Behaviour of Deep Learning Approaches for Long Time-Series
One notable difference between the MIMIC-III benchmark harutyunyan2019multitask and our work is the data resolution. The resolution of our data being twelve-time higher leads to 2016 steps (1 week) sequences for online tasks. Thus, we explore if the increase of sequence length explains DL methods’ decrease in performances for continuous tasks.
One way to verify if DL methods leverage long-term dependencies in their prediction is to check if a decrease in the considered history impacts performance. Among the DL methods, we can verify this for Transformers and TCN by using local attention in the first case and reducing the number of dilated convolutions in the other. In the results (Figure 2), we observe that these two models do not use the additional information provided by early steps. With only a 1h history, Transformer performance stays the same. TCN performance, on the other hand, does drop, but only for a history smaller than 6h. It could explain why LGBM, which does not have access to these extra time steps, manages to outperform both methods compared to stay-level tasks.
Another approach to decrease the length of sequences is to reduce the data resolution. We compare all DL methods with a 1h prediction interval to assess data resolution impact on metrics. This way, we can gradually decrease the data resolution from 5min to 1h while preserving the same prediction time-steps. We report the result of this experiment in Figure 3. We observe that while TCN and Transformer performance are almost identical, GRU and LSTM are both impacted in opposite ways. GRU is noticeably better than LSTM on both tasks with a 5min grid, but as resolution lowers to 1h, methods’ performances become similar.
5.5 On Weighting Cross-entropy by Class Prevalence
All the tasks we define show a high degree of imbalance and the class imbalance problem (CIP) is known to be highly challenging johnson2019survey . The most common approach to this problem is the use of class weights in the loss objective. For this ablation, we adopt the original idea from (king2001logistic, ) by defining weights inversely proportional to class prevalence. However, as shown in Table 5, such a weighting choice decreases performance in all binary classification tasks.
|Task||ICU Mortality||Phenotyping||Circulatory Failure||Respiratory Failure|
|Metric||AUPRC ()||AUROC ()||B-Accuracy ()||AUPRC ()||AUROC ()||AUPRC ()||AUROC ()|
|LGBM w. Feat.||-1.3||0.0||+4.3||-0.8||-0.2||-4.4||-1.1|
In this paper, we provided an in-depth benchmark on the HiRID dataset and evaluated the behaviour of machine learning models on various clinically relevant tasks developed in collaboration with intensive care clinicians. Our primary contribution is a full and reproducible preprocessing and machine learning pipeline and benchmark tasks on a public intensive care dataset, a necessary prerequisite for reproducible and comparable research in the future. We further evaluate current state-of-the-art machine and deep learning algorithms on these tasks establishing a baseline to compare future methods against. We consider this our second major contribution.
This work confirms previous results (hyland2020early, )
, that conventional machine learning models (i.e. boosted ensembles of decision trees) outperform current deep learning approaches on medical time series problems. Based on the experimental results we found that deep learning models do not lead to the same breakthrough performance increases as in other domains (such as NLPdevlin2018bert dosovitskiy2020image ). We believe the sparsity of the data and the imbalance of labels in both regression and classification tasks play an important role in this. For classification tasks, building a specific objective for highly imbalanced tasks such as Focal loss lin2017focal might be a potential area of research. For regression, recent work has shown some promising leads for heavy-tailed regression tasks yang2021delving . Moreover, HiRID introduces a novel high-resolution aspect in ICU data, that needs to be correctly taken into account. Thus, as for other sequence data, one possible explanation could be that when trained with extremely long sequences, models can not use the extracted features in the most effective way zaheer2020big . In the case of Transformers, to force the model to learn and extract useful patterns, various kinds of improvements could be made tay2020efficient . In particular, learnable patterns could be incorporated roy2021efficient .
Our work goes beyond previous ICU time-series benchmarks (e.g. (harutyunyan2019multitask, )) by using a more diverse set of tasks and a data set with a higher time resolution. As discussed earlier the set of clinical prediction tasks is diverse regarding the assessed organ systems, prevalence, and task type. An important limitation of our study is that HiRID is currently not the most frequently used and hence known ICU data set.
This work facilitates the future development of machine learning methods and standardized comparison of their performance on a diverse set of predefined tasks. It could contribute to solving today’s problem of machine learning on medical time series not being comparable due to each work’s unique datasets, preprocessing, and tasks definition. We hypothesize that methods developed and successfully evaluated on these tasks can also be successfully transferred to other specific medical time series problems.
This work also fills the gap between proposed machine learning approaches and their applications to ICU tasks. As a concrete example, COVID-19 is a big challenge for ICU patient monitoring. Important issues in this context are the uncertainty of the patient’s prognosis as well as the prediction of the disease progression. COVID-19 is known to cause respiratory failure li2020neuroinvasive , one of the tasks studied in our benchmark, which is also the main cause for ICU admission and death li2020acute ; ruan2020clinical ; richardson2020presenting ; holter2020systemic . During the current COVID-19 pandemic, e.g. first attempts to construct a Respiratory Failure prediction model were already done bolourani2021machine , however, the data is available only for a limited audience, limiting reproducibility.
In this paper, we proposed an in-depth benchmark on time series collected from an Intensive Care Unit (ICU). In collaboration with clinicians, we defined several tasks relevant for healthcare covering different critical aspects of ICU patient monitoring. We provide a reproducible end-to-end pipeline to derive both data and labels, and a training setup to evaluate the final performance. We hope that this benchmark facilitates the construction and evaluation of machine learning methods for ICU data, and encourages reproducible research in this field.
-  ARDS Definition Task Force, V Marco Ranieri, Gordon D Rubenfeld, B Taylor Thompson, Niall D Ferguson, Ellen Caldwell, Eddy Fan, Luigi Camporota, and Arthur S Slutsky. Acute respiratory distress syndrome: the Berlin Definition. JAMA, 307(23):2526–2533, June 2012.
-  Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.
-  David Bellamy, Leo Celi, and Andrew L Beam. Evaluating progress on machine learning for longitudinal electronic healthcare data. arXiv preprint arXiv:2010.01149, 2020.
-  Siavash Bolourani, Max Brenner, Ping Wang, Thomas McGinn, Jamie S Hirsch, Douglas Barnaby, Theodoros P Zanos, Northwell COVID-19 Research Consortium, et al. A machine learning prediction model of respiratory failure within 48 hours of patient admission for COVID-19: model development and validation. Journal of medical Internet research, 23(2):e24246, 2021.
Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M
The balanced accuracy and its posterior distribution.
2010 20th international conference on pattern recognition, pages 3121–3124. IEEE, 2010.
-  Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
-  Luca Citi and Riccardo Barbieri. Physionet 2012 challenge: Predicting mortality of icu patients using a cascaded SVM-GLM paradigm. In 2012 Computing in Cardiology, pages 257–260. IEEE, 2012.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
-  RK Ellis. Determination of po2 from saturation. Journal of applied physiology, 67(2):902–902, 1989.
-  M. Faltys, M. Zimmermann, X. Lyu, M. Hüser, S. Hyland, G. Rätsch, and T. Merz. HiRID, a high time-resolution icu dataset (version 1.1.1). PhysioNet, 2021.
-  Johann Faouzi and Hicham Janati. pyts: A python package for time series classification. Journal of Machine Learning Research, 21(46):1–6, 2020.
-  Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roger G Mark, Joseph E Mietus, George B Moody, Chung-Kang Peng, and H Eugene Stanley. PhysioBank, PhysioToolkit, and PhysioNet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220, 2000.
-  Ahmed Guecioueur. pysf: Supervised forecasting of sequential data in python, 2018. https://github.com/alan-turing-institute/pysf.
-  Hrayr Harutyunyan, Hrant Khachatrian, David C Kale, Greg Ver Steeg, and Aram Galstyan. Multitask learning and benchmarking with clinical time series data. Scientific data, 6(1):1–18, 2019.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Jan C Holter, Soeren E Pischke, Eline de Boer, Andreas Lind, Synne Jenum, Aleksander R Holten, Kristian Tonby, Andreas Barratt-Due, Marina Sokolova, Camilla Schjalm, et al. Systemic complement activation is associated with respiratory failure in COVID-19 hospitalized patients. Proceedings of the National Academy of Sciences, 117(40):25018–25025, 2020.
-  Max Horn, Michael Moor, Christian Bock, Bastian Rieck, and Karsten Borgwardt. Set functions for time series. In International Conference on Machine Learning, pages 4353–4363. PMLR, 2020.
-  Stephanie L Hyland, Martin Faltys, Matthias Hüser, Xinrui Lyu, Thomas Gumbsch, Cristóbal Esteban, Christian Bock, Max Horn, Michael Moor, Bastian Rieck, et al. Early prediction of circulatory failure in the intensive care unit using machine learning. Nature medicine, 26(3):364–373, 2020.
-  Daniel Jarrett, Jinsung Yoon, Ioana Bica, Zhaozhi Qian, Ari Ercole, and Mihaela van der Schaar. Clairvoyance: A pipeline toolkit for medical time series. In International Conference on Learning Representations, 2020.
-  Alistair Johnson, Lucas Bulgarelli, Tom Pollard, Steven Horng, Leo Anthony Celi, and Roger Mark. MIMIC-IV, 2020.
-  Alistair EW Johnson, Tom J Pollard, and Roger G Mark. Reproducibility in critical care: a mortality prediction case study. In Machine Learning for Healthcare Conference, pages 361–376. PMLR, 2017.
-  Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
-  Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):1–54, 2019.
-  Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30:3146–3154, 2017.
-  Gary King and Langche Zeng. Logistic regression in rare events data. Political analysis, 9(2):137–163, 2001.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  W A Knaus, E A Draper, D P Wagner, and J E Zimmerman. APACHE II: a severity of disease classification system. Crit. Care Med., 13(10):818–829, October 1985.
-  Xu Li and Xiaochun Ma. Acute respiratory failure in COVID-19: is it “typical” ards? Critical Care, 24:1–5, 2020.
-  Yan-Chao Li, Wan-Zhu Bai, and Tsutomu Hashikawa. The neuroinvasive potential of SARS-CoV2 may play a role in the respiratory failure of COVID-19 patients. Journal of medical virology, 92(6):552–555, 2020.
-  Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
-  Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. PyTorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
-  Tom J Pollard, Alistair EW Johnson, Jesse D Raffa, Leo A Celi, Roger G Mark, and Omar Badawi. The eICU collaborative research database, a freely available multi-center database for critical care research. Scientific data, 5(1):1–13, 2018.
-  Matthew A Reyna, Chris Josef, Salman Seyedi, Russell Jeter, Supreeth P Shashikumar, M Brandon Westover, Ashish Sharma, Shamim Nemati, and Gari D Clifford. Early prediction of sepsis from clinical data: the physionet/computing in cardiology challenge 2019. In 2019 Computing in Cardiology (CinC), pages Page–1. IEEE, 2019.
-  Safiya Richardson, Jamie S Hirsch, Mangala Narasimhan, James M Crawford, Thomas McGinn, Karina W Davidson, Douglas P Barnaby, Lance B Becker, John D Chelico, Stuart L Cohen, et al. Presenting characteristics, comorbidities, and outcomes among 5700 patients hospitalized with COVID-19 in the New York City area. Jama, 323(20):2052–2059, 2020.
-  Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 9:53–68, 2021.
-  Qiurong Ruan, Kun Yang, Wenxia Wang, Lingyu Jiang, and Jianxin Song. Clinical predictors of mortality due to COVID-19 based on an analysis of data of 150 patients from wuhan, china. Intensive care medicine, 46(5):846–848, 2020.
-  Reza Sadeghi, Tanvi Banerjee, and William Romine. Early hospital mortality prediction using vital signals. Smart Health, 9:265–274, 2018.
-  Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald Metzler. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732, 2020.
Patrick J Thoral, Jan M Peppink, Ronald H Driessen, Eric JG Sijbrands, Erwin JO
Kompanje, Lewis Kaplan, Heatherlee Bailey, Jozef Kesecioglu, Maurizio
Cecconi, Matthew Churpek, et al.
Sharing icu patient data responsibly under the society of critical care medicine/european society of intensive care medicine joint data science collaboration: The amsterdam university medical centers database (amsterdamumcdb) example.Critical care medicine, 49(6):e563, 2021.
-  Nenad Tomašev, Xavier Glorot, Jack W Rae, Michal Zielinski, Harry Askham, Andre Saraiva, Anne Mottram, Clemens Meyer, Suman Ravuri, Ivan Protsyuk, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature, 572(7767):116–119, 2019.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv preprint arXiv:1706.03762, 2017.
-  Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, and Dina Katabi. Delving into deep imbalanced regression. arXiv preprint arXiv:2102.09554, 2021.
-  Hugo Yèche, Gideon Dredsner, Francesco Locatello, Matthias Hüser, and Gunnar Rätsch. Neighborhood contrastive learning applied to online patient monitoring. In International Conference on Machine Learning. PMLR, 2021.
-  Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, et al. Big Bird: Transformers for longer sequences. arXiv preprint arXiv:2007.14062, 2020.
-  Jack E Zimmerman, Andrew A Kramer, Douglas S McNair, and Fern M Malila. Acute physiology and chronic health evaluation (APACHE) IV: hospital mortality assessment for today’s critically ill patients. Crit. Care Med., 34(5):1297–1310, May 2006.
The HiRID ICU data is provided by the University Hospital Bern, Bern, Switzerland. The dataset consists of 33,905 patients whose length of stays in the ICU range from 0 to 28 days. A more detailed summary of the HiRID cohort statistics can be found in Table 6.
For all patients, a variety of clinical measurements are collected. It represents a set of 710 variables that can be categorized into the following types:
Demographics (e.g.: Sex, Age, Height)
Bedside vital signs (e.g.: Heart rate)
Settings of medical devices (e.g.: Mechanical ventilation)
Manual observations (e.g.: Urine output)
Lab measurements (e.g.: Lactate)
Treatments (e.g.: Vasopressor agents)
One notable difference between these types of measurements is the resolution at which they are provided. Bedside vital signs are provided at regular intervals of 2 min, whereas lab measurements are only available every couple of hours at best. The frequency of recording discrepancy existing between different variables yields unique challenges for machine learning models. Further details about the available clinical measurements can be found in the official documentation of HiRID999URL for the official HiRID documentation https://hirid.intensivecare.ai/ as well as in our software repository101010 Table containing all information for each variable: https://github.com/ratschlab/HIRID-ICU-Benchmark/blob/master/preprocessing/resources/varref.tsv
|APACHE group (Patient phenotype)||Cardiovascular||Surgical||8,125||patients|
|Other medical diseases||1,428||patients|
|Length of stay||Median||0.95||days|
The HiRID data when directly exported from the data management system contains various types of artifacts. The most common one concerns measurement values that are out of the normal range defined by clinicians. Another artifact that we observed is that the time of the first measurement record of cumulative variables is at noon by default, which could be much earlier than the admission time of the patient. This, if not taken into account correctly, will affect the rate calculation for those cumulative variables. To solve these timestamp issues, we clip the time of the first record of the cumulative variables to the ICU admission time of the patient.
There are often various variables in the ICU EHRs that have the same or similar clinical meaning. As an example, there exist three variables for temperature, namely core body temperature, auxiliary temperature, and rectal temperature. Usually, only one or few variables from one medical concept are measured for a patient, therefore, merging variables with the same clinical concept into one meta-variable reduces the sparsity of the feature matrices. Another advantage is that machine learning models trained on medical concept features are more transferable to other hospitals than those trained on the original clinical variables because hospitals usually have their own variable coding scheme. We hence incorporated variable-merging in our first pre-processing step. A challenge that arises from merging pharmaceutical variables is the dosage normalization across drugs with similar active ingredients. Therefore, instead of using the drug dosage as features, which likely contain invalid information, we use “presence/absence” binary indicators of drugs as features for the machine learning models.
Definition of pharmaceutical acting periods
When a patient has been administrated a drug at a time point, only this time point contains a measurement. However, each drug is active for a specific duration after this administration time. For this reason, we ensure to propagate the measurement for a duration corresponding to the drug acting period. We set the value back to zero after this period ensuring compatibility with forward filling imputation.
Conversion of cumulative values to rates
There are a few fluid output variables whose measurement values are recorded as cumulative every 12 hours starting from noon on the ICU admission day. Since the cumulative signals are periodic, we converted them to hourly fluid rates which are more interpretable and amenable as machine learning features and for the endpoint definition.
Endpoint definition and statistics
Endpoint extraction algorithms
Our benchmark suite contains a range of clinically relevant tasks chosen to range over a spectrum of different properties of tasks, e.g. task type from the machine learning point-of-view (regression, binary classification, multi-class classification), point of the ICU stay at which the prediction is made (fixed time-point vs. predictions made throughout the stay), as well as the degree of class imbalance (highly imbalanced tasks vs. more balanced tasks). While in the main paper we describe the main idea of each task, the full definition and implementation details of the extraction algorithm are described below.
Dependencies on previous pipeline stages.
Static information about the patient is extracted from the static.parquet file, in which the mortality status and the APACHE-II/IV codes for the admission are stored. This file is derived from the general_table.parquet and observations tables, where APACHE II/IV codes are stored, during the data preprocessing pipeline stage. Respiratory/kidney/circulatory failure labels are generated from the merged stage of the data, via an intermediate imputation step, in which measurements are forward filled to statistics about the number of real measurements in time grid intervals were pre-computed. These simplified the implementation of the endpoints, whose extraction algorithms use a combination of imputed values and real measurement detection strategies.
Label masking on presence of vital sign monitoring.
All tasks described below were additionally masked with a condition on a current connection to vital sign monitors. This implied setting a valid label (according to the below task definitions) to invalid (NaN) if the patient had no heart rate measurement (vm1) in the 15 minutes surrounding the time grid point. Since the expected frequency of heart rate measurements is 1 sample / 2 minutes, it is likely the patient is disconnected when this condition is satisfied, and no reliable predictions on their state can be made.
To derive the APACHE diagnostic group of the patient, the two fields APACHE-II group and APACHE-IV group of the static HiRID data are used. To map these two fields to a standardized encoding, the conversion Table 7 is used. If the patient had a valid APACHE-II group entry and its code was mapped in the table, the target APACHE group code was used. Otherwise, we tried to fall back to the APACHE-IV group entry, where again if it was present and mapped, the corresponding target APACHE group code was used. If still no code could be extracted, the prediction task was defined as invalid for the patient. The label of the time-point 24h after ICU admission was set to the class category of the APACHE group, defining a multi-class classification problem to be solved once in the stay. If the admission was shorter than 24h, no label for the patient was assigned.
|APACHE II Code||APACHE IV Code||Original group name||Target group||Target group name|
|197||Urogenital||6||Other medical diseases|
|102||Sepsis||6||Other medical diseases|
|106||198||Other||6||Other medical diseases|
|206||Intoxication||6||Other medical diseases|
|103||194||Trauma not surgical||7||Trauma not surgical|
|107||199||(Cardio)vascular surgical||11||(Cardio)vascular surgical|
|108||201||Respiratory surgical||12||Respiratory surgical|
|109||200||Gastrointestinal surgical||13||Gastrointestinal surgical|
|110||202||Neurologic surgical||14||Neurologic surgical|
|111||203||Trauma surgical||15||Trauma surgical|
|112||204||Renal surgical||16||Renal surgical|
|113||Gynacology surgical||17||Other surgical|
|114||Orthopedics surgical||17||Other surgical|
|205||Other surgical||17||Other surgical|
First, the ICU mortality label was extracted from the general data table of the HiRID database. The label of the time-point 24 hours after ICU admission was set to 1 (positive) if the patient died at the end of the stay according to this field, and 0 (negative) otherwise, defining a binary classification problem to be solved once per stay. If the admission was shorter than 24 hours, no label was assigned to the patient.
Remaining length of stay.
The continuous regression label of a time-point in the ICU stay was defined as the number of hours that remain until the end-of-the-stay, i.e. the first label of the ICU stay is equal to the ICU stay length (in hours) and the last label just before dispatch is defined as 0. To determine the beginning and end of the ICU stay, the first and last observed heart rate measurements (vm1) were used.
As a basis for extracting the kidney function label, the urine output rate (vm24) variable and the weight variable (vm131) were used. The valid labels were anchored to real measurements, i.e. updates, of the urine output rate. Only time-points exactly 2h before an update of the urine output rate were assigned a prediction label. The overall urine output in the 2h after this time point was found by computing the integral of the urine output rate variable (vm24) over this 2h period. The overall urine output in the time interval was then normalized to the weight of the patient (unit kg) at the time of the prediction, and then standardized to a rate/hour. Hence, the unit of the regression output is ml/kg/h.
As a basis for computing the circulatory failure status of a patient at every time-point of the ICU stay, the following variables were used: Mean arterial pressure (vm5), arterial lactate (vm136) and the vasopressive agents Milrinone (pm42), Dobutamine (pm41), Levosimendan (pm43), Theophylline (pm44), Norepinephrine (pm39), Epinephrine (pm40) and Vasopressin (pm45). For defining the label at time a centralized window of size 2h was anchored at , and the lactate/MAP conditions in this window were separately analyzed. The time-point was assigned a (tentative) positive label if, for at least 2/3 of the time-points inside the 2h, the lactate condition was satisfied, and for at least 2/3 of the time-points inside the 2h, the MAP condition was satisfied. The time-points at which they have to be satisfied do not necessarily have to coincide. The conditions defining the circulatory failure state are listed below:
Lactate condition. Elevated arterial lactate ( mmol/l)
MAP/vasopressor condition. Low MAP ( mmHg ) or any dose of any of the considered vasopressors (pm39, pm40, pm41, pm42, pm43, pm44, pm45) given to the patient, which could potentially mask a low MAP.
Using these endpoint annotations of “circulatory failure” and “no circulatory failure”, a label at time point is either (1) invalid (no prediction made), if the patient was in circulatory failure at , (2) positive if the patient was not circulatory failure at , but there was a new onset of circulatory failure in the next 12h, and (3) negative if the patient was not in circulatory failure at , and there was no onset of circulatory failure in the next 12h.
Estimating the respiratory failure status of a patient at a given time-point involves estimating (1) their instantaneous FiO value, (2) their instantaneous PaO value and computing the P/F ratio as P/F = PaO/FiO, and then labeling the time series according to the threshold P/F = 300 mmHg, to define (mild) failure and stability periods.
We distinguished between the following cases to estimate the FiO value at a time-point .
FiO from ventilator: If there was a FiO measurement (vm58) in the last 30 min and the patient was on the ventilator or the ventilation mode NIV (vm60) was active, then the last FiO measurement (vm58) was directly used as the estimate.
FiO from supplementary oxygen (oxygen mask): If there was a real measurement in supplementary oxygen (vm23) in the last 12h, then it was forward filled and used to estimate FiO via the conversion Table 8.
FiO from ambient air: An ambient air FiO assumption was made, and FiO was estimated as 21 %.
|Supp. oxygen [l]||FiO [%]|
|Supp. oxygen [l]||FiO [%]|
We distinguished between the following cases tor estimating the PaO value at a time-point. As source variables peripheral oxygen saturation (vm20) and PaO from ABGA (vm140
) were used. Hereby the SpO2 variable was pre-smoothed with a percentile (75 % percentile) moving window filter of size 30 min to remove spuriously low outlier measurements.
Real PaO measurement: If there was a real PaO (vm140) from an arterial blood gas analysis available in the last 30 minutes, then the estimate was defined directly by the measurement.
Ellis estimate from SpO: Otherwise the last measurement of peripheral oxygen saturation measured by pulse oximetry (vm20) was used to compute an estimate of PaO according to Ellis et al. . If there is no real SpO measurement in the last 24h, a default value of 98 % was assumed for the estimation.
First, the resulting PaO estimate was smoothed with a Nadaraya-Watson kernel smoother with a bandwidth of 20 min. Then, each so-derived PaO estimate was weighted by a squared exponential kernel and converted to the closest plausible PaO measurement. The scale of the kernel function was chosen such that a1h distance with the measurement time resulted in a weight of .
The time series was labeled for respiratory failure by using forward-facing windows anchored at time-point . A patient is considered as being in respiratory failure at if, for at least 2/3 time-points of the next 2h, either of the following conditions hold:
The estimated P/F ratio is mmHg and the patient is not on mechanical ventilation OR
The estimated P/F ratio is mmHg and the patient is on mechanical ventilation, but the PEEP value is not densely measured (no real measurement in 30 min) OR
The estimated P/F ratio is mmHg and the patient is on mechanical ventilation, and the PEEP is densely measured and the PEEP is mmHg.
After the definition of event periods, their edges were corrected according to estimates of the P/F ratio at the given time points. If before the event P/F mmHg, the event is further extended in the past. Likewise, if the P/F mmHg condition holds after the event, it is extended into the future.
Finally, small events shorter than 4h, preceded and succeeded by two stability periods, of which one is longer than 4h are deleted. This is because these likely represent intermittent/spurious instances of respiratory failure. Conversely, short stability periods shorter than 4h, preceded and followed by periods of respiratory failure, of which one is longer than 4h, are defined as respiratory failure. Likely, during these periods, the patient does not recover from the failure in a clinical sense.
Statistics on annotated failure events and labels
Below we display statistics on the data resulting from the endpoint stage, and from the label stage, which was used directly as inputs for the machine learning models.
|Patients with events||83.08 %||28,157 admissions|
|Per-time prevalence of resp. failure||61.83 %|
|Median # events per patient (if 1 event) [IQR]||1 [1-2]|
|Median event duration [IQR]||9.5 [4.1-20.6] hours|
|Median time to first event (if 1 event) [IQR]||1.58 [0-11] hours|
|Median gap length between two events [IQR]||4.4 [2.1-11] hours|
|Overall label prevalence (Stable->Failure 12h)||8.6 %|
|Median label prevalence p/patient (if 1 event) [IQR]||0 [0-60.1] %|
Summary statistics about annotated respiratory failure (oxygenation failure with a P/F ratio <300 mmHg) events in the pre-processed data-set. The label prevalence refers to the observed probability of developing respiratory failure in the next 12h, given that the patient is currently stable.
|Patients with events||25.58 %||8,669 admissions|
|Per-time prevalence of circ. failure||6.70 %|
|Median # events per patient (if 1 event) [IQR]||1 [1-2]|
|Median event duration [IQR]||2.9 [1.1-7.6] hours|
|Median time to first event (if 1 event) [IQR]||1.9 [0.5-7.3] hours|
|Median gap length between two events [IQR]||3.3 [1.3-9.6] hours|
|Overall label prevalence (Stable->Failure 12h)||1.4 %|
|Median label prevalence p/patient (if 1 event) [IQR]||3.8 [0-19.1] %|
Mortality prediction / Patient phenotyping
|Percentage of ICU stays with length >24h||44.1 %||14,962 admissions|
|Overall label prevalence (Mortality at 24 hours)||8.38 %|
|Prevalence ’Neurologic’ (4)||23.6 %||3,511 admissions|
|Prevalence ’(Cardio)vascular’ (1)||15.9 %||2,367 admissions|
|Prevalence ’(Cardio)vascular surgical (11)||12.4 %||1,848 admissions|
|Prevalence ’Respiratory’ (2)||11.0 %||1,635 admissions|
|Prevalence ’Neurologic surgical’ (14)||7.4 %||1,106 admissions|
|Prevalence ’Trauma not surgical’ (7)||6.7 %||990 admissions|
|Prevalence ’Gastrointestinal’ (3)||6.3 %||930 admissions|
|Prevalence ’Other medical disease’ (6)||5.2 %||778 admissions|
|Prevalence ’Gastrointestinal surgical’ (13)||4.8 %||713 admissions|
|Prevalence ’Metabolic/Endocrinology’ (8)||2.3 %||344 admissions|
|Prevalence ’Respiratory surgical’ (12)||1.7 %||246 admissions|
|Prevalence ’Other surgical’ (17)||1.2 %||181 admissions|
|Prevalence ’Trauma surgical’ (15)||1.0 %||148 admissions|
|Prevalence ’Hematologic’ (9)||0.4 %||57 admissions|
|Prevalence ’Renal surgical’ (16)||0.2 %||27 admissions|
Kidney function / Remaining-length-of-stay regression
In this section we provide a description of the full pipeline of the HiRID-ICU Benchmark and go through the main steps in detail. We also address the ML Reproducibility checklist questions111111ML Reproducibility checklist.
Data loading. Request access and then download the data from https://physionet.org/content/hirid/1.1.1/ (see how to request access to the data in the README section of the Software Repository121212https://github.com/ratschlab/HIRID-ICU-Benchmark/.) The data is available under the PhysioNet Credentialed Health Data License 1.5.0.
Data preprocessing. The preprocessing step is described in detail in the Appendix A: Dataset Details
. It is composed of several stages: merge stage, resample stage, feature extraction stage.
Task implementation. In this stage we construct the state annotations and machine learning labels for our predefined set of tasks.
Model training and evaluation. We select and train both ML and DL models on the training and validation sets. We evaluate model performance on the test set.
The script run.py in the icu_benchmarks folder unites three stages of the pipeline which can be used individually: Preprocessing (Section B.1), Task implementation (Section B.2), and Evaluation (Section B.3). Each stage has an associated help function. Below we describe each stage in detail.
The method preprocess unites all necessary steps to generate the input for the Deep Learning (DL) and Machine Learning (ML) models:
run_merge_step — Converts ICU data from table format to a matrix format that can be input to the machine learning model (see details in Appendix A: Dataset Details).
run_resample_step — Re-samples data to a regularly sampled 5min resolution.
run_feature_extraction_step — Hand-engineered feature extraction for classical machine learning models.
run_build_dl — Data splitting and preprocessing specific to ML.
Running these steps requires the following arguments:
hirid-data-root — Path to the unpacked parquet files containing the HiRID data, downloaded from Physionet.
work-dir — Path to the working directory of the user.
var-ref-path — Path to the file containing meta-data about variables necessary for our pre-processing pipeline.
split-path — Path to the exact split of the data for model training and evaluation.
nr-workers — Number of workers to use during training. It depends on the user hardware capacity.
To run the whole preprocessing, the command below is used:
The second stage of the pipeline, consisting of extracting task labels, is constructed around the following three steps:
imputation_for_endpoints — Imputes data for endpoint generation.
generate_endpoints — Generates the endpoints i.e. verify if conditions for events are matched at each timestep.
generate_labels — Extract final labels from endpoints.
As mentioned in the paper, we construct a set of 6 tasks, relevant for healthcare workers:
Mortality_At24Hours — Mortality prediction at 24h after admission in the ICU (further Mortality).
Phenotyping_APACHEGroup — Patient APACHE group classification 24h after admission in the ICU (further Patient Phenotyping).
Dynamic_CircFailure_12Hours — Continuous prediction of occurrence of circulatory failure in the next 12 hours (further Circulatory Failure).
Dynamic_RespFailure_12Hour — Continuous prediction of occurrence of respiratory failure in the next 12 hours (further Respiratory Failure).
Remaining_LOS_Reg — Continuous prediction of the remaining stay duration (further Remaining LOS).
Dynamic_UrineOutput_2Hours_Reg — Continuous prediction of patient urine production in the next 2h (further Kidney Function).
Model Training and Evaluation
In the final part of the pipeline, we provide the user with a choice of model to train:
from the list of DL models: Transformer, LSTM, GRU and Temporal CNN. All these models use the class DLWrapper defining methods mechanics for deep learning models’ training and evaluation.
from the list of the ML Models: Gradient Boosting method and Logistic Regression. In the same manner as DL models, these models have a MLWrapper class.
All models are trained on the training set. The validation set is used for early stopping and model selection through a random search. After the training procedure, the model performance is evaluated on the test set.
To run the training, the following command is used (as an example, we provide here the command for GRU model training for the Dynamic_CircFailure_12Hours task):
Argument -l specifies the path to the directory where the trained model and meta-data will be stored.
Argument -t specifies the selected task.
Argument -sd specifies the random seed.
Argument -c specifies path to gin-config configuration file. This file should include the path to the data after preprocessing, the task, the model with all hyper-parameters.
An example of the configuration file for the GRU model is provided below131313See examples of configuration files for all the tasks and models in the config folder of the software repository.