Multiple organ failure (MOF) is a clinical syndrome with variable causes including pathogens , complicated pathogenesis , and a major cause of mortality and morbidity for trauma patients who are admitted to Intensive Care Units (ICU) . Based on recent studies on ICU trauma patients, up to have developed MOF, and MOF increased the overall risk of death times compared to patients without MOF . To prevent the development of MOF for trauma patients from progression to an irreversible stage, it is essential to diagnose MOF early and effectively. Many scoring systems have been proposed to predict MOF [5, 6, 7, 8] and researchers have attempted to predict MOF on trauma patients using predictive models in an early phase [9, 10].
The rapid growth of data availability in clinical medicine requires doctors to handle extensive amounts of data. As medical technologies become more complicated, technological advances like machine learning (ML) are increasingly needed to improve real-time analysis and interpretation of the results . In recent years, practical uses of ML in healthcare have grown tremendously, including cancer diagnosis and prediction [12, 13, 14], tumor detection [15, 16], medical image analysis , and health monitoring [18, 19].
Compared to traditional medical care, ML-assisted clinical decision support enables a more standardized process for interpreting complex multi-modality data. In the long term, ML can provide an objective viewpoint for clinical practitioners to improve performance and efficiency . ML is often referred to as a black box: explicit input data and output decisions, but opaque at intermediate learning process. Additionally, in medical domains, there is no universal rule for selecting the best configuration to achieve the optimal outcome. Moreover, medical data has its own challenges such as numerous missing values  and colinear variables . Thus it is difficult to process the data and choose the proper model and corresponding parameters, even for a ML expert. Furthermore, detailed quantitative analysis of the potential impacts of different settings of ML systems on MOF has been missing.
In this paper, we experiment with comprehensive ML settings for prediction of MOF, considering different dimensions from data preprocessing (missing value treatment, label balancing, feature scaling), feature selection, classifier choice, to hyperparameter tuning. To predict MOF for trauma patients at an early stage, we use only initial time measurements (hour
) as inputs. We mainly use area under the receiver operating characteristic curve (AUC) to evaluate MOF prediction outcomes. We focus on analyzing the relationships among configuration complexity, predicted performance, and performance variation. Additionally, we quantify the relative impacts of each dimension.
The main contributions of this paper include:
To the best of our knowledge, this is the first paper to conduct a thorough empirical analysis quantifying the predictive performance with exhaustive ML configurations for MOF prediction.
We provide general guidance for ML practitioners in healthcare and medical fields through quantitative analysis of different dimensions commonly used in ML tasks.
Experimental results indicate that classifier choice contributes most to both performance improvement and variation. Complex classifiers including ensemble methods bring higher default/optimized performance, along with a higher risk of inferior performance compared to simple ones on average.
The remainder of this paper is organized as follows. Section 2 describes the dataset and features we use. All of the ML configurations are available in Section 3. Experimental results are discussed in Section 4. Finally, our conclusions are presented in Section 5.
Our dataset, collected from the San Francisco General Hospital and Trauma Center, contains highest level trauma activation patients evaluated at the level I trauma center. Due to the urgency of medical treatment, there are numerous missing values for time-dependent measurements. Thus we have chosen to consider only those features with a maximum missing value percentage of over all patients. To obtain a timely prediction, early lab measurements (hour ) as well as patients’ demographic and illness information were extracted as the set of features. Detailed feature statistics are available in Table 1.
|Demographic||gender, age, weight, race, blood type|
|Illness||comorbidities, drug usage|
Our target variable consists of binary class labels ( for no MOF and for MOF). Then, the data with feature and target variables is randomly split into training and testing sets at the ratio of .
Based on ML pipelines and special characteristics of our data such as large number of missing values and varying scales in feature values, we consider comprehensive ML configurations from the following dimensions: data preprocessing (missing value treatment (MV), label balancing (LB), feature scaling (SCALE)), feature selection (FS), classifier choice (CC), and hyperparameter tuning (HT). In the remainder of the paper, we will interchangeably use the full name and corresponding abbreviations shown in parentheses. Further details on each dimension are described below.
3.1 Data Preprocessing
Methods to handle the dataset with missing values, imbalanced labels, and unscaled variables are essential for the data preprocessing process. We use several different methods to deal with each of these problems.
3.1.1 Missing Value Treatment
In our dataset, numerous time-dependent features cannot be recorded on a timely basis, and missing data is a serious issue. We consider three different ways to deal with missing values, where the first method serves as the baseline setting for MV, and the latter two methods are common techniques of missing value imputation in ML.
Remove all patients with any missing values for the features listed in Section 2.
Replace missing values with mean for numerical features and mode for categorical features over all patients.
Impute missing values by finding the -nearest neighbors with the Euclidean distance metric for each feature respectively.
3.1.2 Label Balancing
Our dataset is imbalanced as the sample class ratio between class and class is . Keeping imbalanced class labels serves as the baseline setting for LB. Three different ways are considered to resample the training set.
Oversampling the minority class (label )
Method: SMOTE (synthetic minority over-sampling technique) .
Explanation: choose -nearest neighbors for every minority sample and then create new samples halfway between the original sample and its neighbors.
Undersampling the majority class (label )
Method: NearMiss .
Explanation: when samples of both classes are close to each other, remove the samples of the majority class to provide more space for both classes.
Combination of oversampling and undersampling
Method: SMOTE & Tomek link .
Tomek link: two samples are -nearest neighbors to each other but come from different classes.
Explanation: first create new samples for the minority class and then remove the majority class sample in any Tomek link.
3.1.3 Feature Scaling
Since the range of feature values in our dataset varies widely, we perform feature scaling. No scaling on any feature serves as the baseline setting for SCALE. Two common scaling techniques are used for numerical features.
Normalization: rescale values to range between and .
Standardization: rescale values with mean.
3.2 Feature Selection
In medical datasets, there usually exist many highly correlated features, and some features that are weakly correlated to the target [22, 26]. Thus it is essential to identify the most relevant features that may help to improve the outcome of the analysis. Using all of the features described in Section serves as the baseline setting for FS. We consider two main feature selection techniques: filter and wrapper methods.
Filter-based methods (independent of classifiers):
Use correlation between features and the target to select features which are highly dependent on the target.
Filter out numerical features using ANOVA -test and categorical features using test.
Wrapper-based methods (dependent on classifiers):
Method: RFE (recursive feature elimination) in random forest.
Explanation: perform RFE repeatedly such that features are ranked by importance, and the least important features are disregarded until a specific number of features remains.
3.3 Classifier Choice
We experimented with classifiers on the dataset. In general, these classifiers can be divided into two main categories: single and ensemble. Lists of all classifiers are available in Table 2
. For ensemble classifiers (combination of individual classifiers), we tried bagging (BAG, RF, ET), boosting (GB, ABC, XGB, LGBM), voting (VOTE) and stacking (STACK). In bagging, DT is a homogeneous weak learner. Multiple DTs learn the dataset independently from each other in parallel and the final outcome is obtained by averaging the results of each DT. In boosting, DT also serves as a homogeneous weak learner. However, DTs learn the dataset sequentially in an adaptive way (new learner depends on previous learners’ success), and the final outcome is determined by weighted sum of previous learners. In voting, heterogeneous base estimators (LR, RF, SVM, MLP, ET) are considered, where each estimator learns the original dataset and the final prediction is determined by majority voting. In stacking, several heterogeneous base learners (RF, KNN, SVM) learn the dataset in parallel, and there exists a meta learner (LR) that combines the predictions of the weak learners. Abbreviations of classifiers shown in parentheses for voting and stacking are the ones we use.
|Single classifiers||Ensemble classifiers|
3.4 Hyperparameter Tuning
Hyperparameters are crucial for controlling the overall behavior of classifiers. Default hyperparameters of classifiers serve as the baseline setting for HT. We apply grid search to perform hyperparameter tuning for all classifiers. Detailed information about tuned hyperparameters is available in Table 3.
|LR||C, class_weight, penalty|
|SVM||, gamma, kernel, class_weight|
|KNN||n_neighbors, weights, algorithm|
|MLP||activation, solver, alpha|
|ABC||base_estimator, n_estimators, learning_rate|
|LGBM||num_leaves, colsample_bytree, subsample, max_depth|
|VOTE||C (SVM), n_estimators (ET)|
|STACK||C (SVM), n_neighbors (KNN)|
4 Experiments and Results
We formulated MOF prediction as a binary classification task. All of the experiments in this paper were implemented using scikit-learn . As mentioned in Section 2, our training and testing dataset is randomly split with a ratio of
. One-hot encoding is applied to all categorical features. For each classifier, we use the same training and testing dataset. We use AUC as our main performance metric, as it is commonly used for MOF prediction in the literature[6, 28, 29]
. It provides a “summary” of classifier performance compared to single metrics such as precision and recall. AUC represents the probability that a classifier ranks a randomly chosen positive sample (class) higher than a randomly chosen negative sample (class ), and thus useful for imbalanced datasets. In this section, we quantify the impacts (improvement and variation) of each dimension on the predicted performance over our testing dataset.
4.1 Influence of Individual Dimensions
First, we evaluate how much each dimension contributes to the AUC score improvement and variation respectively, and find the correlation between performance improvement and variation over all dimensions.
4.1.1 Performance Improvement across Dimensions
For HT, MV, LB, SCALE, and FS, we define the baseline as default hyperparameter choices, using no missing value imputation, no label balancing, no feature scaling, and no feature selection, respectively. For CC, we choose SVM, which achieves the median score among all classifiers, as the baseline. Then we quantify the performance improvement of each dimension. Fig. 1 shows the percentage that each dimension contributes to the improvement in the AUC score over baseline by tuning only one dimension at a time while leaving others at baseline settings. We observe that CC contributes most to the performance improvement () for MOF prediction. After CC, LB (), FS (), MV (), HT (), and FS () bring decreasing degrees of performance improvement in the AUC score.
Table 4 shows the improvement of every single dimension on each classifier over the baseline. In general, MV and LB tend to provide the greatest performance improvement for most classifiers. For RF, ET, and LGBM, FS contributes the most to improvement in performance since these classifiers require feature importance ranking intrinsically, and external FS improves their prediction outcomes to a large extent. Note that the classifier for which SCALE has the largest impact is KNN, as it is a distance-based classifier which is sensitive to the range of feature values. Also, due to instability and tendency to overfit, HT is the most critical for DT improvement.
|Classifier||MV (%)||LB (%)||SCALE (%)||FS (%)||HT (%)|
In addition to AUC, other performance metrics are used to measure the performance improvement degree of each dimension. The results in Table 5 reveal that CC brings the greatest improvement regardless of the metrics we use. Contributions from HT and SCALE are relatively small compared to other dimensions.
4.1.2 Performance Variation across Dimensions
For all of the ML configurations, we further investigate how much each dimension contributes to the performance variation in the AUC score. By tuning only one dimension at a time while leaving other dimensions at baseline settings, we obtain a range of AUC scores. Performance variation is the difference between the maximum and the minimum score of each dimension. Fig. 2 shows the proportion of each dimension that brings the performance variation in the AUC score. Based on Fig. 2, we notice that CC, which brings the largest performance improvement, also brings the largest performance variation ( %). After CC, LB ( %), FS ( %), MV ( %), HT ( %), and SCALE ( %) bring decreasing degrees of performance variation in the AUC score.
Table 6 shows the variation of every single dimension on each classifier over the baseline. We observe that for each classifier, if one dimension brings a larger performance improvement, it also results in a larger performance variation. For our assessment of performance variation, the same metrics as above are used for evaluation on each dimension. Using the same metrics as above, Table 7 shows that the proportion of performance variation in different metrics from each dimension follows an order that is consistent with the performance improvement in Table 5. Thus, for different metrics, greater improvement brings greater variation of each dimension. For every step that researchers take when predicting MOF using ML, they should always be aware of the trade-off between benefits (improvement in performance) and risks (variation in performance) when adjusting each dimension.
|Classifier||MV (%)||LB (%)||SCALE (%)||FS (%)||HT (%)|
4.2 Performance Comparison across Classifiers
We have shown that classifier choice is the largest contributor to both performance improvement and variation in the AUC score. Hence, we further investigate the performance differences among classifiers. Specifically, we investigate the relationships among classifier complexity, performance, and performance variation.
4.2.1 Default versus Optimized Performance
Default classifiers are defined as classifiers with default parameters, while optimized classifiers are those for which hyperparameter tuning with -fold cross validation is applied using grid search. We compare the performance of default and optimized classifiers in consideration of all other dimensions, i.e., MV, LB, SCALE, and FS. The average AUC scores of all classifiers with default and optimized settings are shown in Fig. 3. In general, ensemble classifiers perform better than single classifiers regardless of default or optimized performance.
In addition to AUC, other performance metrics are used to evaluate the performance of all classifiers. We use the median score to rank classifiers with both default and optimized settings. Then, NDCG (normalized discounted cumulative gain), one of the most prevalent measures of ranking quality , is used to compare classifier rankings between each of these metrics and the AUC score. Detailed relevance scores are shown in Table 8. The result indicates that the median performance of each classifier is similar no matter which metric is used. This also suggests that the AUC score can represent classifiers’ overall performance well.
Based on the above experiments, ensemble classifiers should be prioritized in MOF prediction since they usually bring better predictive performance than single classifiers.
|Default (%)||Optimized (%)|
4.2.2 Performance Variation across Classifiers
We measure the performance variation for each classifier in consideration of all other dimensions, i.e., MV, LB, SCALE, FS, and HT. For each classifier, we get a range of AUC scores. The size of the range determines the extent of performance variation. Fig. 4 shows the performance variation in the AUC score of all classifiers. The order of listed classifiers on the -axis is based on increasing model complexity, which is measured by classifier training time with default settings. The complexity of classifiers and performance variation demonstrates an evident ‘U-shaped’ relationship. When the classifier is ‘too simple’, its performance variation is relatively large. When the complexity of the classifier is ‘appropriate’, the performance variation is relatively small. If the classifier becomes ‘too complex’, it is also at the risk of larger performance variation. Therefore, classifiers with ‘appropriate’ complexity are more stable, with smaller changes in performance, while ‘too simple’ or ‘too complex’ classifiers are relatively unstable with larger changes in performance in general.
In addition to AUC, the same metrics as above were used to validate the performance variation of all of the classifiers. We use the range (difference between maximum and minimum scores) to rank classifiers in consideration of MV, LB, SCALE, FS, and HT. Then, NDCG is used to compare classifier rankings between each of these metrics and the AUC score. Table 9 displays detailed relevance scores. The result suggests that other metrics show a similar ‘U-shaped’ relationship between classifier complexity and performance variation as the AUC score. When predicting MOF, it is inappropriate for clinical practitioners to choose ‘too simple’ and ‘too complex’ classifiers since they may run the risk of underfitting and overfitting, respectively.
We have provided a timely MOF prediction using early lab measurements (hour 0), patients’ demographic and illness information. Our study quantitatively analyzes the performance via the AUC score in consideration of a wide range of ML configurations for MOF prediction, with a focus on the correlations among configuration complexity, predicted performance, and performance variation. Our results indicate that choosing the correct classifier is the most crucial step that has the largest impact (performance and variation) on the outcome. More complex classifiers including ensemble methods can provide better default/optimized performance, but may also lead to larger performance degradation, without careful selection. Clearly, more MOF data is needed to provide a more general conclusion. Our work can potentially serve as a practical guide for ML practitioners whenever they conduct data analysis in healthcare and medical fields.
This work was funded by the National Institutes for Health (NIH) grant NIH R01 - HL149670.
-  Harjola, V.P., Mullens, W., Banaszewski, M., et al.: Organ dysfunction, injury and failure in acute heart failure: from pathophysiology to diagnosis and management. A review on behalf of the Acute Heart Failure Committee of the Heart Failure Association (HFA) of the European Society of Cardiology (ESC). European Journal of Heart Failure, vol. 19, pp. 821-836 (2017).
-  Wang, Z.K., Chen, R.J., Wang, S.L., et al.: Clinical application of a novel diagnostic scheme including pancreatic ‑cell dysfunction for traumatic multiple organ dysfunction syndrome. Molecular Medicine Reports, vol. 17, pp. 683-693 (2018).
-  Durham, R. M., Moran, J. J., Mazuski, J. E., et al.: Multiple organ failure in trauma patients. Journal of Trauma and Acute Care Surgery, vol. 55, pp. 608-616 (2003).
-  Ulvik, A., Kvåle, R., Wentzel-Larsen, T., et al.: Multiple organ failure after trauma affects even long-term survival and functional status. Critical Care, vol. 11, pp. R95 (2007).
-  Barie, P.S., Hydo, L.J., Fischer, E.: A prospective comparison of two multiple organ dysfunction/failure scoring systems for prediction of mortality in critical surgical illness. The Journal of Trauma, vol. 37, pp. 660-666 (1994).
-  Bota, D.P., Melot, C., Ferreira, F.L., et al.: The Multiple Organ Dysfunction Score (MODS) versus the Sequential Organ Failure Assessment (SOFA) score in outcome prediction. Intensive Care Medicine, vol. 28, pp. 1619-1624 (2002).
-  Dewar, D.C., White, A., Attia, J., et al.: Comparison of postinjury multiple-organ failure scoring systems. Journal of Trauma and Acute Care Surgery, vol. 77, pp. 624-629 (2014).
-  Hutchings, L., Watkinson, P., Young, J.D., et al.: Defining multiple organ failure after major trauma. Journal of Trauma and Acute Care Surgery, vol. 82, pp. 534-541 (2017).
-  Sauaia, A., Moore, F.A., Moore, E.E., et al.: Multiple Organ Failure Can Be Predicted as Early as 12 Hours after Injury. Journal of Trauma and Acute Care Surgery, vol. 45, pp. 291-303 (1998).
-  Vogel, J.A., Liao, M.M., Hopkins, E., et al.: Prediction of postinjury multiple-organ failure in the emergency department. Journal of Trauma and Acute Care Surgery, vol. 76, pp. 140-145 (2014).
-  Obermeyer, Z., Emanuel, E.J.: Predicting the Future — Big Data, Machine Learning, and Clinical Medicine. New England Journal of Medicine, vol. 375, pp. 1216-1219 (2016).
-  Cruz, J.A., Wishart, D.S.: Applications of Machine Learning in Cancer Prediction and Prognosis. Cancer Informatics, vol. 2, pp. 117693510600200030 (2006).
-  Kourou, K., Exarchos, T.P., Exarchos, K.P., et al.: Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, vol. 13, pp. 8-17 (2015).
-  Asri, H., Mousannif, H., Al Moatassime, H., et al.: Using Machine Learning Algorithms for Breast Cancer Risk Prediction and Diagnosis. Procedia Computer Science, vol. 83, pp. 1064-1069 (2016).
-  Sharma, K., Kaur, A., Gujral, S.: Brain Tumor Detection based on Machine Learning Algorithms. International Journal of Computer Applications, vol. 103, pp. 7-11 (2014).
-  Wang, Z., Yu, G., Kang, Y., et al.: Breast tumor detection in digital mammography based on extreme learning machine. Neurocomputing, vol. 128, pp. 175-184 (2014).
-  De Bruijne, M.: Machine learning approaches in medical image analysis: From detection to diagnosis. Medical Image Analysis, vol. 33, pp. 94-97 (2016).
-  Farrar, C.R., Worden, K.: Structural health monitoring: a machine learning perspective, https://onlinelibrary.wiley.com/doi/book/10.1002/9781118443118 (2013).
-  Worden, K., Manson, G.: The application of machine learning to structural health monitoring. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 365, pp. 515-537 (2007).
Ahmed, Z., Mohamed, K., Zeeshan, S., et al.: Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database, vol. 2020 (2020).
-  Janssen, K.J., Donders, A.R.T., Harrell Jr, F.E., et al.: Missing covariate data in medical research: To impute is better than to ignore. Journal of Clinical Epidemiology, vol. 63, pp.721-727 (2010).
Tuba, E., Strumberger, I., Bezdan, T., et al.: Classification and Feature Selection Method for Medical Datasets by Brain Storm Optimization Algorithm and Support Vector Machine. Procedia Computer Science, vol. 162, pp. 307-315 (2019).
-  Chawla, N.V., Bowyer, K.W., Hall, L.O., et al.: SMOTE: Synthetic Minority Over-sampling Technique. Journal of Artificial Intelligence Research, vol. 16, pp. 321-357 (2002).
-  Mani, I., Zhang, I.: kNN approach to unbalanced data distributions: a case study involving information extraction. In: Proceedings of workshop on learning from imbalanced datasets, vol. 126 (2003).
-  Batista, G.E., Bazzan, A.L., Monard, M.C.: Balancing Training Data for Automated Annotation of Keywords: a Case Study. In: WOB, pp. 10-18 (2003).
-  Dağ, H., Sayin, K.E., Yenidoğan, I., et al.: Comparison of feature selection algorithms for medical data. In: 2012 International Symposium on Innovations in Intelligent Systems and Applications, pp. 1-5 (2012).
-  Pedregosa, F., Varoquaux, G., Gramfort, A., et al.: Scikit-learn: Machine learning in Python. the Journal of machine Learning research, vol. 12, pp. 2825-2830 (2011).
-  Bakker, J., Gris, P., Coffernils, M., et al.: Serial blood lactate levels can predict the development of multiple organ failure following septic shock. The American Journal of Surgery, vol. 171, pp. 221-226 (1996).
-  Papachristou, G.I., Muddana, V., Yadav, D., et al.: Comparison of BISAP, Ranson’s, APACHE-II, and CTSI Scores in Predicting Organ Failure, Complications, and Mortality in Acute Pancreatitis. American Journal of Gastroenterology, vol. 105, pp. 435-441 (2010).
Chen, W., Liu, T.Y., Lan, Y., et al.: Ranking measures and loss functions in learning to rank. In: Advances in Neural Information Processing Systems, vol. 22, pp. 315-323 (2009).