Semantically Enhanced Dynamic Bayesian Network for Detecting Sepsis Mortality Risk in ICU Patients with Infection

by   Tony Wang, et al.

Although timely sepsis diagnosis and prompt interventions in Intensive Care Unit (ICU) patients are associated with reduced mortality, early clinical recognition is frequently impeded by non-specific signs of infection and failure to detect signs of sepsis-induced organ dysfunction in a constellation of dynamically changing physiological data. The goal of this work is to identify patient at risk of life-threatening sepsis utilizing a data-centered and machine learning-driven approach. We derive a mortality risk predictive dynamic Bayesian network (DBN) guided by a customized sepsis knowledgebase and compare the predictive accuracy of the derived DBN with the Sepsis-related Organ Failure Assessment (SOFA) score, the Quick SOFA (qSOFA) score, the Simplified Acute Physiological Score (SAPS-II) and the Modified Early Warning Score (MEWS) tools. A customized sepsis ontology was used to derive the DBN node structure and semantically characterize temporal features derived from both structured physiological data and unstructured clinical notes. We assessed the performance in predicting mortality risk of the DBN predictive model and compared performance to other models using Receiver Operating Characteristic (ROC) curves, area under curve (AUROC), calibration curves, and risk distributions. The derived dataset consists of 24,506 ICU stays from 19,623 patients with evidence of suspected infection, with 2,829 patients deceased at discharge. The DBN AUROC was found to be 0.91, which outperformed the SOFA (0.843), qSOFA (0.66), MEWS (0.73), and SAPS-II (0.77) scoring tools. Continuous Net Reclassification Index and Integrated Discrimination Improvement analysis supported the superiority DBN. Compared with conventional rule-based risk scoring tools, the sepsis knowledgebase-driven DBN algorithm offers improved performance for predicting mortality of infected patients in ICUs.



page 1

page 2

page 3

page 4


Early ICU Mortality Prediction and Survival Analysis for Respiratory Failure

Respiratory failure is the one of major causes of death in critical care...

Integrating Physiological Time Series and Clinical Notes with Deep Learning for Improved ICU Mortality Prediction

Intensive Care Unit Electronic Health Records (ICU EHRs) store multimoda...

Parkland Trauma Index of Mortality (PTIM): Real-time Predictive Model for PolyTrauma Patients

Vital signs and laboratory values are routinely used to guide clinical d...

Machine Learning to Support Triage of Children at Risk for Epileptic Seizures in the Pediatric Intensive Care Unit

Objective: Epileptic seizures are relatively common in critically-ill ch...

Comparing Clinical Judgment with MySurgeryRisk Algorithm for Preoperative Risk Assessment: A Pilot Study

Background: Major postoperative complications are associated with increa...

Personalized Risk Scoring for Critical Care Prognosis using Mixtures of Gaussian Processes

Objective: In this paper, we develop a personalized real-time risk scori...

DeepAISE -- An End-to-End Development and Deployment of a Recurrent Neural Survival Model for Early Prediction of Sepsis

Sepsis, a dysregulated immune system response to infection, is among the...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Although timely sepsis diagnosis and prompt interventions in Intensive Care Unit (ICU) patients are associated with reduced mortality, early clinical recognition is frequently impeded by non-specific signs of infection and failure to detect signs of sepsis-induced organ dysfunction in a constellation of dynamically changing physiological data. The goal of this work is to identify patient at risk of life-threatening sepsis utilizing a data-centered and machine learning-driven approach. We derive a mortality risk predictive dynamic Bayesian network (DBN) guided by a customized sepsis knowledgebase and compare the predictive accuracy of the derived DBN with the Sepsis-related Organ Failure Assessment (SOFA) score, the Quick SOFA (qSOFA) score, the Simplified Acute Physiological Score (SAPS-II) and the Modified Early Warning Score (MEWS) tools.

A customized sepsis ontology was used to derive the DBN node structure and semantically characterize temporal features derived from both structured physiological data and unstructured clinical notes. We assessed the performance in predicting mortality risk of the DBN predictive model and compared performance to other models using Receiver Operating Characteristic (ROC) curves, area under curve (AUROC), calibration curves, and risk distributions.

The derived dataset consists of 24,506 ICU stays from 19,623 patients with evidence of suspected infection, with 2,829 patients deceased at discharge. The DBN AUROC was found to be 0.91, which outperformed the SOFA (0.843), qSOFA (0.66), MEWS (0.729), and SAPS-II (0.766) scoring tools. Continuous Net Reclassification Index and Integrated Discrimination Improvement analysis supported the superiority DBN with respect to SOFA, qSOFA, MEWS, and SAPS-II. Compared with conventional rule-based risk scoring tools, the sepsis knowledgebase-driven DBN algorithm offers improved performance for predicting mortality of infected patients in intensive care units.


In the U.S., up to 52% of all hospital deaths, typically in the ICU, involve sepsis[1]. Notably, patients presenting with initially less severe sepsis account for a majority of these sepsis deaths[2].

Given this prevalence, predictive tools that identify at risk patients during early stages of disease progression could drive important reductions in hospital mortality. Studies indicate that patients often have detectable signatures of physiological decompensation or deterioration in monitoring data hours before events such as septic shock or unexpected death[3, 4, 5, 6].

At the same time, the clinical manifestations of sepsis are highly dynamic, depending on the initial site of infection, the causative organism, the pattern of acute organ dysfunction, the underlying health status of the patient, and the interval before initiation of treatment[7]. Physiological data tracking tools with excellent predictive value that assist in the timely identification of patients at imminent risk (with hours of an event) may lead to improved outcomes [8]. Each hour of delay in the administration of recommended therapy is associated with a linear increase in the risk of mortality[9, 10].

Tracking tools can also help the treatment team strike an effective balance between judicious utilization of limited, high cost monitoring resources while providing the highest intensity of care to optimize the patient’s survival[11, 12].

Sepsis Mortality Risk Prediction Tools

The 2016 Third International Consensus Definitions for Sepsis Task Force defined sepsis as a “life-threatening organ dysfunction due to a dysregulated host response to infection”[13]. Using a validation cohort of 7,932 ICU encounters with signs of infection and associated organ dysfunction and mortality observations (16%), the Task Force retrospectively established the predictive validity of a change of two or more points in the SOFA score (AUROC = 0.74; 95% CI, 0.73-0.76) for the identification of patients at risk of hospital mortality from sespsis using patient physiological data from up to 48 hours before to up to 24 hours after the onset of infection [14].

A qSOFA score, as a simple surrogate for organ dysfunction in cases lacking adequate SOFA physiological, data was similarly derived and validated (AUROC = 0.66; 95% CI, 0.64-0.68), and compared with the Systemic Inflammatory Response Syndrome (SIRS) (AUROC = 0.64; 95% CI, 0.62-0.66)[14]. Recent studies indicate that the use of two-point SIRS criteria, or two-point qSOFA criteria for identifying at risk ICU patients is not as effective as SOFA, thus limiting their use for predicting mortality in this setting[15, 16, 17].

ICU Mortality Risk Prediction Tools

A number of ICU patient scoring systems for assessing disease severity that predict mortality outcomes have been developed and used for standardizing research and for comparing the quality of patient care across ICUs. In addition to SOFA[18], these tools include the SAPS[19], Acute Physiologic and Chronic Health Evaluation (APACHE)[20], and the Mortality Prediction Model (MPM)[21].

No single instrument has convincing or proven superiority to another in its ability to predict death[22, 23, 24]. For comparative purposes in this study, beyond SOFA, we have chosen to use SAPS-II, which, compared to APACHE, conveniently predicts mortality via a closed form equation. We also included the relatively simple (Modified Early Warning Score) MEWS[25], initially proposed as a scoring method for emergency departments, that has been used for predicting length of stay and mortality in ICU setting[26].

Machine Learning (ML)

In addition to individualized patient rule-based analysis of clinical data, modern EMRs include the ability to analyze large repositories of institutional multi-patient longitudinal data with powerful ML algorithms that can learn complex patterns in physiological data without being explicitly programmed, unlike simpler rule-based systems. Most importantly, these algorithms can be trained to recognize/classify early patterns in large data repositories to identify with high precision hospitalized patients at risk of sepsis mortality early enough to effectively intervene before clinical deterioration

[27, 28, 29, 30].

Generalized linear models (i.e. logistic regression, Cox regression), characterized as having high explanatory power, are the most common algorithms used to develop risk prediction models using EMR data


. Commonly used linear ML methods in sepsis predictive analytics include multivariate linear regression

[32], logistic regression[33, 34, 17, 6]

and linear support vector machine (SVM)

[29, 35] classifiers. However, sepsis has been characterized as a “complex systems” class problem[36], with multiple organ failures that represent a “chaotic adaptation” to severe stress[37] which cannot be explained by linear statistical systems[38]. The recognized complexity of the disease where actual covariate interrelationships can be complex and non-linear may explain the modest predictive performance of tools derived using logistic regression.

With the tremendous growth in the data science field, many nonlinear machine learning algorithms, although more “black box”, are now available. Not limited by linearity assumptions, they can explore different covariate interrelationship options for predicting mortality. Tang et al


explored the use of a nonlinear support vector machine using features extracted from spectral analysis of cardiovascular monitors. Quinn et al

[40] proposed a factorial switching linear dynamic system to model the patient states based on 8 physiological measurements. The 2012, PhysioNet/Computing in Cardiology Challenge aimed to tackle the problem of in-hospital mortality prediction using the MIMIC-II (Multiparameter Intelligent Monitoring in Intensive Care)[41]

dataset with 36 physiologic time series from first 2 days of admission. Challenge participants explored various prediction algorithms including time series motifs, neural networks, and Bayesian ensembles but show contradictory relative predictive performance results

[42, 43, 44, 45].

On the other hand, studies that have explored dynamic Bayesian networks that model temporal dynamics of features have shown promise[46, 47, 48]. Recently, a “super-learner ensemble” algorithm designed to find optimal combinations of a collection of prediction algorithms was reported to achieve enhanced performance in mortality prediction in ICU[49].

Dynamic Bayesian Networks (DBN)

Another key characteristic of sepsis is the rapid progression of the disease[50]. Fast and appropriate therapy is the cornerstone of sepsis treatment[51]. Despite methodological advances in ML, most methods model temporal trends independently, without capturing correlated trends manifesting underlying changes in pathophysiologic states. A recent systematic review of risk prediction models derived from EMR data identified incorporation of time-varying (longitudinal) factors as a key area of improvement in future risk prediction modelling efforts[31]. Most machine learning algorithms take standalone snapshots of numerical measurements as features, mainly because they can be easily extracted and may have robust statistical properties.

It is generally recognized that temporal trends, particularly in one or more observations reflecting a rapidly changing patient state, can be more expressive and informative than individual signs. Tools that provide bedside clinicians with information about a patient’s changing physiological status in a manner that is easy and fast to interpret may reduce the time needed to detect patients in need of life-saving interventions.

To address these challenges, we investigate DBNs which have the ability to model complex dynamics of a system by placing correlated temporal manifestations into a network. DBNs have been used to model temporal dynamics of sepsis, to predict sepsis in presenting Emergency Department patients[46], and to predict sequences of organ failures in ICU patients[47, 48].

Concept Maps (Cmap)

Our DBN approach applies an evidence-based network structure based on expert-constructed concept maps (Cmaps) which are a relatively recent semantic web technology[52] that employs a user-friendly approach to enable generation of customized hierarchical ontologies in the form of concept graphs.

Cmaps are semi-formal (subject concept-predicate-object concept triples) graphical expressions of diagnostic clinical reasoning as expressed by clinicians over observed evidence such as vials, labs and exam findings. Cmaps formalize “deep semantic” models reflecting clinical reasoning that can be used to contextually characterize raw patient encounter clinical history, time-tagged symptoms and signs, laboratory tests, medications, procedures, orders and other clinical data. The “structured reasoning” knowledge expressed in these graphs includes both explicit evidence-based relationships (e.g. guidelines, protocols) between concepts as well as implicit expertise (e.g. heuristics, atypical patterns) learned from years of medical practice. We use these maps to create computable ontologies that can be used to automatically semantically classify raw medical record data in effect contextually interpreting real-time data based on this expert knowledge. In the case of sepsis these models can be used to express concept/relationships across risk factors, treatments, history and label otherwise ambiguous features such as vitals and labs which are reflected in the DBN as causal nodes and relationships.

In this study, customized high level concepts and relationships between concepts in the form of simple “subject-predicate-object” collections of sepsis domain nodes and relationships are enhanced with historically available medical ontologies such as the Systematized Nomenclature of Medicine (SNOMED)111©The International Health Terminology Standards Development Organisation (IHTSDO). Cmaps are used to guide the DBN node structure and the contextual meaning of raw clinical data and to enable inductive rule-based decision support and clinical event characterizations for machine learning closely reflecting how expert clinicians would cognitively process patient data.

Fig 1 shows a partial hierarchical Cmap representing high level sepsis (Sepsis-2) evidence[53] and reasoning concepts.

Although ontologies have been used in biomedical research[54], and there is growing interest in the use of semantic models in healthcare[55], few existing applications of predictive analytics in clinical decision support leverage the power of comprehensive customized ontologies in rule engine and machine learning applications. It has also been demonstrated that using Cmaps to elicit content knowledge from physicians can significantly reduce the time and costs necessary to implement a physician’s knowledge/reasoning logic into operational systems[56]. The need for adaptation of clinical prediction models for application in local settings and across new problem domains is well recognized[57]

. The use of semantic knowledge graphs authored by expert clinicians in machine learning can be an effective mechanism for rapid validation of new models or recalibration of existing models in new settings and/or populations.

Fig 1: Partial view of a sepsis concept map. Expert-constructed concept maps (Cmaps) representing evidence and clinician’s reasoning for Sepsis detection (partial view).

Materials and methods

Study Dataset

The data used in this study is exempted from IRB review, as Quorum Review IRB agrees that it meets the following criteria set forth in The Code of Federal Regulations, TITLE 45, Part 46.101(b)(4)222

Research involving the collection or study of existing data, documents, records, pathological specimens, or diagnostic specimens, if these sources are publicly available or if the information is recorded by the investigator in such a manner that subjects cannot be identified, directly or through identifiers linked to the subjects.

Clinical encounter data of adult patients (age18 years) were extracted from the MIMIC-II version 26 ICU database[41]. MIMIC-II consists of retrospective hospital admission data of patients admitted into Beth Israel Deaconess Medical Center from 2001 to 2008. Included ICUs are medical, surgical, trauma-surgical, coronary, cardiac surgery recovery, and medical/surgical care units. Although MIMIC-II includes both time series data recorded in the EMR during encounters (e.g. vital signs/diagnostic laboratory results, free text nursing notes/radiology reports, medications, discharge summaries, treatments, etc.) as well as high-resolution physiological data (time series / waveforms) recorded by bedside monitors, only the time series data recorded in the EMR was used in this study.

Patient Inclusion

Although MIMIC-II includes all patients who were admitted in the period, we focus on patients who experienced at least one ICU stay. Only patients who were determined to have had an infection or suspected infection were included in this analysis. Infection status was determined using a combination of structured fields and free-text nursing notes. We used the following criteria: (1) antibiotics and/or microbiological culture order during encounter; (2) ICD-9-CM diagnosis codes based on Sepanski et al. [58]; (3) free-text nursing notes containing text suggesting an infection or suspected infection.

To identify patients with infection in nursing notes we developed a supervised ML algorithm utilizing a bag-of-words[59] and linear kernel SVMs[60]. The ML algorithm was trained on a dataset that was automatically generated using a simple heuristic. We observed that whenever there is an existing infection or a suspicion of infection, the nursing notes typically describe the fact that the patient is taking or is prescribed infection-treating antibiotics. Thus, identifying nursing notes describing the use of antibiotics will, in most cases, also identify nursing notes describing signs and symptoms of infection333While the MIMIC database contains structured medication information, the list of medications is associated with the patient’s admission. An admission can contain numerous clinical notes and prescribed medications. This is why we were unable to use the structured medications data in order to construct a note-level training dataset. Additionally, antibiotics were characterized as “prophylactic” (negating infection) based on reasoning over use patterns and patient context (e.g. pre- or post-surgery). We utilized word embeddings[61] to create a list of rules that unambiguously identify the use of infection-treating antibiotics. Word embeddings were generated utilizing all available MIMIC-III nursing notes444We used vector size 200, window size 7, and continuous bag-of-words model.. The initial set of antibiotics was then extended using the closest word embeddings in terms of cosine distance. For example, the closest words to the antibiotic amoxicillin are amox, amoxacillin, amoxycillin, cefixime, suprax, amoxcillin, amoxicilin. As shown, this includes misspellings, abbreviations, similar drugs and brand names. The extended list was then manually reviewed. The final infection-treating antibiotic list consists of 402 unambiguous expressions indicating antibiotics.

Antibiotics, however, are sometimes negated and are often mentioned in the context of allergies (e.g. allergic to penicillin). To distinguish between affirmed, negated, and speculated mentions of administered antibiotics, we also developed a set of rules in the form of keyword triggers. Similarly to the NegEx algorithm [62], we identified phrases that indicate uncertain or negated mentions of antibiotics that precede or follow a list of hand-crafted expression at the sentence and clause levels.

The described approach identified 186,158 nursing notes suggesting the unambiguous presence of infection (29%) and 3,262 notes suggesting possible infection. The remaining 448,211 notes (70%) were considered to comprise our negative dataset, i.e. not suggesting infection. The automatically generated dataset was used to train SVMs which achieved an F1-score ranging from 79 to 96%[60].

While we created a dataset relying on the mention of antibiotics, the ML algorithm was abled to identify positive infection-describing notes not containing mentions of antibiotics. Depending on the infection source and the specifics of the patient history, signs and symptoms of infection can vary widely. Utilizing ML enabled us to capture a wide range of infection signs and symptoms, often (but not always) accompanied by a description of antibiotics use.

Data preparation

DBN models a temporal process by discretizing time and building a network slice on each discretized time point on which the system’s status is presented. Basically, DBN network structure is repeated on the basis of time slices, and nodes on temporally different slices are connected by appropriate edges. Therefore, one important DBN data preparation step is to create aligned physiological data suitable for time slices across study patients with highly diverse vitals and lab measurement/observation rates. Fortunately, MIMIC-II represents an integrated set of diverse physiological measurements and events that are time-stamped over the entire ICU stay, enabling us to align measurements on a coherent time line and to locate an appropriate interval to which every measurement belongs.

Considering that the MIMIC-II sampling rate of physiologic vitals data varies from 12 to 120 minutes with a mean and median of approximately 60 minutes, a 1-hour interval between time-slices for vitals was found feasible. Lab measurements were far less frequent. To normalize variable measurement frequency, we used most recent readings for vitals and labs with a loopback window of 4 hours and 16 hours respectively, i.e., vitals such as systolic blood pressure (SBP) and temperature readings were considered “current” for 4 hours until a newer reading was recorded received. Similarly, lab results such as white blood cell (WBC) counts was considered “current” for 16 hours. To avoid effects of sparse vitals data, we tested 4-hour, 8-hour, and 12-hour intervals for mortality predictions.

A “rolled-back” approach was applied in selection of training data[63], i.e. using the discharge time as a reference time of survival/mortality event, selecting time slices N hours prior to event for training. The rationale for applying this method is based on: (1) physiological measurements close to the ICU end event may have a stronger discriminating relationship with the event than more remote (earlier) measurements; (2) across the entire population of ICU patients, mortality is a statistically rare event. If time slices were to be selected clock-wise in the ICU encounter and limited to remote data, severe imbalance in death variable could bias the model to the majority class and have significant impact on mortality predictive model performance[64].

Considered variables (shown as DBN nodes in Fig 2) on a time slice are: 1) vital signs: heart rate, respiratory rate, body temperature, systolic blood pressure, diastolic blood pressure, mean arterial pressure, oxygen saturation, urinary output; 2) laboratory tests: white blood cell count, ALT/AST, bilirubin, platelets, hemoglobin, lactate, creatinine, bicarbonate; blood gas measurements (partial pressure of arterial oxygen, fraction of inspired oxygen, and partial pressure of arterial carbon dioxide); 4) Glasgow Coma Scale; and 5) indicators for non-prophylactic antibiotics and vasopressor usage.

Fig 2: Network Structure of DBN. Circle line with box 1 indicates the order 1 auto-regressive relationship between time slices. (Abbreviations used: ALT=Alanine Transaminase Test, AST=Aspartate Aminotransferase Test, DBP=Diastolic Blood Pressure, FiO2=Fraction of Inspired O2, GCS=Glasgow Coma Scale, INR=Prothrombin Time Test, MAP=Mean Arterial Pressure, PaCO2=Partial Pressure of Carbon Dioxide in Arterial Blood, PaO2=Partial Pressure of O2 in Arterial Blood, PlateletCnt=Platelet Count, SBP=Systolic Blood Pressure, SPO=Peripheral Capillary Oxygen Saturation, Uout=Urinary Output, WBC=White Blood Cell Count m prior to a label indicates missing value, e.g. mFiO2=Missing Fraction of Inspired O2.)

Variables on a time slice are created by the following rules:

  1. Align all patient data by discharge timestamp, and look back 4 hours before discharge.

  2. For each vital sign, check the interval between discharge timestamp - 4 hours and discharge timestamp – 8 hours, to see if any measurement was taken. A variable will be missing if no measurement is taken during the interval. If a single measurement was taken, it’s treated as the value of a vital sign variable. If multiple measurements were taken during the 4-hour interval, a mean of these measurements is the value of a vital sign variable.

  3. For lab data, check the interval between discharge timestamp - 4 hours and discharge timestamp – 16 hours. Apply the same rules as those described above to find values for variables. That is, a physiological variable will be missing if no measurement is taken during the interval. If a single measurement was taken, it’s treated as the value of a physiological variable. If multiple measurements were taken during the 12-hour interval, a mean of these measurements is the value of a lab variable.

To meet the normality assumption of continuous variables in a DBN, a normality transformation was applied to variables which are not normally distributed. Normalization was typically achieved using base-10 logarithmic transformation. In cases where log-transformation was not appropriate, the Box-Cox power transformation was applied

[65]. No special treatment for missing data was applied in our experiments. The Bayes Server555Version 7.8, Bayes Server Ltd. West Sussex, UK

, a commercially available suite of ML algorithms was used in our study to train and test the DBN models, supports parameter learning with missing values using the Expectation Maximization (EM) algorithm


Vital and lab features were additionally semantically characterized using the ontology and associated rules to enhance specificity. For example, based on the finding from Lin and Haug[67], missing indicators for lactate and/or blood gas parameters were added as features to model potential non-ignorable missingness. Additionally, abnormal lab values suggestive of sepsis-induced organ dysfunction were semantically viewed in the context of patient chronic conditions and/or use of medications that otherwise “explained” lab abnormalities and negated semantic assertions of infection etiology in the ontology.

Network Structure

The basic DBN structure is directed by Cmaps with nodes in the DBN representing concepts in the Cmap, and edges between nodes representing connecting links on the Cmap. The process may be viewed as Cmap-bridged construction of an expert system[68]. An unconstrained BN that is built from data starting with an empty network represents an NP-hard problem. The NP-hard problem is avoided with the use of Cmaps as the knowledge base[69, 70].

A Bayesian Network (BN) is a graphical model describing the statistical relationships through complex interactions and dependencies among variables[66]. The BN is composed of a Directed Acyclic Graph (DAG) structure and distribution parameter . A node in

corresponds to a random variable, and an edge in

directed from node to encodes ’s conditional dependence on , with being the parent of . By graphical model theory, a variable is independent of its non-descendants given all its parents are in [66]

. Then, the joint probability distribution over the variable set

can be decomposed by the chain rule:

The parameter set specifies the parameters of each conditional distribution . The meaning of is interpreted by specific distributions: is simply a conditional probability table if the assumed distribution is multinomial, and

is the mean and variance when the distribution is Gaussian.

The temporal dynamics of vital signs and lab measurements is modeled using the DBN framework. A DBN is an extended BN used to model temporal processes[71]. In a DBN, random processes are represented by a set of nodes , the random variable of process at time . The edges within slice indicate the relationship between these random variables, and additional edges between and model the time dependencies. Usually, process is assumed to be Markovian. Any node in a DBN is only allowed to be linked from the nodes in the same or previous slice, i.e.

. DBN generalizes hidden Markov model (HMM) and state space model (SSM), by representing the hidden / observed state in terms of variables

[71]. The basic BN structure for body dynamics is extended hierarchically, such that the model allows for the key parameters of the model to differ among time slices to account for potential non-linearity over time, and across individuals to account for between-patient heterogeneity.

Fig 2 shows the DBN network structure of a time slice of physiological variables used for mortality prediction in this study. DBNs have the desirable property that they allow for interpretation of interactions between the different variables, and can be used to predict event in far future [71]. As an example of feature interactions we explored the relationship between acute respiratory distress and acute kidney injury via links between creatinine and P/F factors in the model. With forward-backward operators, DBN can make online prediction of events on any future timeslice, which is similar to time-series models. In this study, we will utilize this property to examine its performance in temporally remote time frames.


Producing well-calibrated prediction probabilities is crucial for supporting clinical decision making. To make sure the predicted and recorded probabilities of death coincide well, we applied the method of ensembling linear trend estimation

[72] to calibrate the predicted probabilities of mortality. The method utilizes the trend filtering signal approximation approach to find the mapping from uncalibrated classification scores to the calibrated probability estimates.

Performance Measures

We examine the capability of DBN in predicting ICU mortality by evaluating sensitivity, specificity, positive predictive value, negative predictive value, F1 measures, and AUROC obtained through 10-fold cross validation, reported with 95% confidence interval obtained through a computationally efficient influence curve based approach


. DBN’s performance is compared with SOFA, qSOFA, MEWS, and SAPS-II. For SOFA, we obtained mortality prediction by regressing ICU death indicator on SOFA scores (main effect only). Both the first SOFA score and maximal SOFA score are considered in separate models. The recommended cut-point 2 is used for calculating evaluation metrics for qSOFA


To examine its performance in predicting events in relatively remote future, we considered two validation sets. In the first validation set, time slices are created clockwisely, that is, the first time slice started from the admission timestamp. with 3-timeslice data, we will predict in-ICU mortality (1) within 12 hours after the third timeslice, and (2) within 24 hours after the third timeslice. We name this dataset as ‘Validate Set 1‘. The second validating set is created with rolled back-method. However, we aim to predicting mortality after discharge from ICU, instead of in-ICU mortality. Again, two predictions will be generated, within-12-hour mortality and within-24-hour mortality. All validation sets are composed of died patients in designed time frames and randomly selected 10% survivors. For each prediction on the validation set, DBN is trained on the set (constructed with the rolled-back method described above) with patients in the validation set excluded based on patient ID and ICU-stay ID.

MEWS is constructed with a modification to the neurological component. The original component scores 0 for “alert”, 1 for “Reacting to voice”, 2 for “Reacting to pain”, and 3 for “Unresponsive”. In our construction, GCS is used with arbitarily chosen cut-offs which mimicks the relationship between GCS and head injury [74, 75]: neurological score =0 if 14≤GCS≤15, neurological score =1 for 10≤GCS≤14, neurological score =2 for 6≤GCS≤10, and neurological score =3 if GCS6. Similar to SOFA, the mortality prediction is obtained by regressing ICU death indicator on MEWS (main effect only).

For Simplified Acute Physiology Score II (SAPS-II), the predicted probability is obtained by formula which is suggested by Le Gall et al.[19]:


ROC curves and boxplots of predicted probabilities of death for survivors and non-survivors are displayed for assisting the assessment of discrimination capability (Fig 3 and Fig 4).

Summary reclassification measures of the continuous Net Reclassification Index (cNRI) and the Integrated Discrimination Improvement (IDI) will be calculated with SOFA or SAPS-II scoring classifier as the initial model and DBN as the updated classifier. Positive values of the cNRI and IDI suggest that the updated classifier has better discriminative ability than the initial classifier, whereas negative values suggest the opposite[76, 77, 78].


Study Data

24,506 ICU stays for 19,623 adult patients with suspected or confirmed infection are included in the analysis out of a total of 29,431 ICU stays for 23,701 patients. Demographics and baseline characteristics of patients are shown in Table 1 indicating that 2,829 (14.4%) patients were deceased at discharge. Non-survivors were older with median age 73.7 years (IQR: 59.5 – 82.9) than the survivors at 64.4 years median age (IQR: 50.7 – 76.9). Non-survivors stayed in the ICU a median 3.9 days (IQR: 1.8 – 8.5), compared to 2.3 median LOS (length of stay) for survivors (IQR: 1.3 - 4.7). Table 2

shows the descriptive statistics of vital signs and lab measurements. All statistical analyses are done with openly statistical computing software R and associated risk prediction modules (R Version 3.3.1, PredictABEL

[79], pROC[80], and caret[81].)

0.25in0in Characteristics, N (%) Survivor Non-survivor Total 16794 (85.6%) 2829 (14.4%) 19623 Male, n (%) 9516(56.7) 1519(53.7) 11035(56.3) Race, n (%) White 11910(70.9) 1960(69.3) 13870(70.7) Black 1282(7.6) 165(5.8) 1447(7.4) Hispanic 536(3.2) 45(1.6) 581(3.0) Asian 366(2.2) 57(2.0) 423(2.2) Other 2700(16.1) 602(21.3) 3302(16.8) Admission Type, n (%) Elective 3262(16.5) 112(4) 3374(14.9) Emergent 15818(80) 2587(91.6) 18405(81.5) Urgent 689(3.5) 126(4.5) 815(3.6) ICU Type, n (%) Coronary Care Unit 3910(19.8) 620(21.9) 4530(20) Cardiac surgery recovery 7001(35.4) 638(22.6) 7639(33.8) Combined Medical/Surgical ICU 2150(10.9) 320(11.3) 2470(10.9) Medical ICU 5640(28.5) 1137(40.2) 6777(30.0) Surgical ICU 1068(5.4) 110(3.9) 1178(5.2) Pressor usage, n (%) 4533(21.3) 1573(49.0) 6106(24.9) Antibiotics usage, n (%) 10780(52.4) 1863(58.0) 13305(56.0) Age, median (IQR) 64.4(50.7, 76.9) 73.7(59.5, 82.9) 65.8(51.9, 78.0) LOS, median (IQR) 2.3(1.3, 4.7) 3.9(1.8, 8.5) 2.5(1.3, 5.0)

Table 1: Demographic and ICU encounter characteristics by mortality at discharge.

IQR = Inter-quartile range, LOS = Length of Stay.

Tables 2 and 3 below shows measured physiological variables and vital signs at the time slice close to discharge, and comparative performances of ICU mortality predictions using SOFA, qSOFA, SAPS-II, MEWS, and the DBN.

-0.25in0in Variable Survivor (Q1, Q3) Non-survivor (Q1, Q3) Total (Q1, Q3) Diastolic Blood Pressure 59.3(51.5, 68.3) 47.7(36.7, 58.3) 58.3(50.2, 67.5) Systolic Blood Pressure 122.25(110.0, 136.8) 96.5(74.6, 117.7) 120.5(107.3, 135.6) Temperature 36.8(36.4, 37.2) 36.8(36.1, 37.6) 36.8(36.4, 37.2) PaCO2 41(37, 46) 39(32, 48) 41(36, 46) Pulse 82.3(72.6, 92.5) 86.8(71.8, 104.0) 82.7(72.5, 93.5) Respiratory Rate 19.27(16.8, 22.3) 19.5(14.4, 24.3) 19.3(16.5, 22.5) Creatinine 0.9(0.7, 1.2) 1.6(0.9, 2.8) 0.9(0.7, 1.3) WBC 10.1(7.6, 13.1) 13.9(8.7, 20.3) 10.2(7.7, 13.5) FiO2 0.41(0.4, 0.5) 0.5(0.4, 0.8) 0.5(0.4, 0.6) PaO2 100.0(82.0, 126.0) 100.3(76.0, 146.0) 100.0(80.0, 130.4) SpO2 97(95.6, 98.3) 93.8(84.8, 97.4) 97(95.3, 98.3) Lactate 1.3(1, 1.8) 5.65(2.72, 11.3) 1.9(1.2, 5.3) GCS 15(15, 15) 6.5(3, 11) 15(14, 15) Hemoglobin 10.4(9.4, 11.5) 10(9.0, 11.2) 10.3(9.4, 11.4) Bicarbonate 26(23, 28) 21(16, 26) 25(23, 28) Mean Arterial Pressure 81.25(72.3, 92.5) 63(50.0, 76.1) 78.5(68.8, 90.5) INR 10.4(9.4, 11.45) 10.1(9.01, 11.2) 10.3(9.4, 11.4) Platelet count 1.3(1, 1.8) 5.7(2.5, 11.8) 2.07(1.3, 6.2) ALT 2.9(2.5, 3.3) 2.5(2.1, 3) 2.8(2.4, 3.3) AST 42(21, 98) 82(25, 285) 46(21, 127) Bilirubin 26(23, 28) 22(17, 26.5) 25(23, 28)

Table 2: Physiological measurements at the time slice closest to discharge: median (IQR).

IQR = Inter-quartile range

-0.25in0in qSOFA SOFA First SOFA Max SAPSII MEWS* DBN AUC 0.66(0.637-0.683) 0.756(0.73-0.781) 0.853(0.816-0.871) 0.766(0.735-0.798) 0.729(0.686-0.771) 0.913(0.906-0.919) Sensitivity 0.846(0.239-1.0) 0.571(0.317-0.825) 0.665(0.544-0.787) 0.759(0.647-0.871) 0.518(0.339-0.696) 0.825(0.802-0.849) Specificity 0.288(0-0.874) 0.767(0.51-1.023) 0.818(0.707-0.929) 0.63(0.536-0.724) 0.775(0.608-0.943) 0.874(0.84-0.908) PPV 0.104(0.076-0.133) 0.248(0.085-0.412) 0.301(0.21-0.392) 0.163(0.131-0.195) 0.19(0.104-0.277) 0.474(0.416-0.533) NPV 0.975(0.931-1.018) 0.942(0.923-0.961) 0.956(0.942-0.97) 0.966(0.955-0.977) 0.945(0.934-0.955) 0.973(0.971-0.976) F1 0.187(0.146-0.228) 0.326(0.224-0.428) 0.41(0.341-0.478) 0.268(0.224-0.312) 0.272(0.199-0.344) 0.602(0.56-0.643)

Table 3: Comparison of performance of scores in mortality prediction among patients with infection.

*Scoring of neurological variable in MEWS is based on GCS: neurological score =0 if 14≤GCS≤15, neurological score =1 for 10≤GCS≤14, neurological score =2 for 6≤GCS≤10, and neurological score =3 if GCS 6.

Fig 3 plots ROC curves of all scoring methods. The AUROC is 0.66 (95% CI: 0.637 – 0.683) for qSOFA, 0.756(95% CI: 0.73 - 0.781) for the first SOFA score, 0.843 (95% CI: 0.816 - 0.871) for the maximum SOFA score, 0.729 (95% CI: 0.686 – 0.771) for MEWS, and 0.766 (95% CI: 0.735 - 0.798) for the SAPS-II score. A substantial improvement in performance is observed from the newly built DBN model, which gives us an AUROC of 0.91(95% CI: 0.888 - 0.933). Table 4 indicates great predicting capabilities for both in-ICU and after-discharge mortality, especially for the time relatively close to the last time point set for training. As expected, reductions in performance are observed when the time is relatively far away, e.g. for mortality within 24 hour after discharge, AUC decreased from 0.968 (95% CI: 0.958 - 0.978) to 0.866 (95% CI: 0.839 - 0.893). Cox calibration analysis suggests that the DBN prediction is well calibrated (, see LABEL:S1_Appendix for more details).

Fig 3: Receiver-operating characteristic curves of all scoring methods.

0in0in Time Metric In-ICU Mortality After-discharge Mortality 12 hours AUC 0.985(0.981 - 0.989) 0.968(0.958 - 0.978) Sensitivity 0.940(0.899 - 0.969) 1(1 - 1) Specificity 0.942(0.923 - 0.977 0.941(0.937 - 0.957) 24 hours AUC 0.975(0.967 - 0.982) 0.866(0.839 - 0.893) Sensitivity 0.923(0.882 - 0.952) 0.721(0.655 - 0.819) Specificity 0.940(0.918 - 0.977) 0.889(0.783 - 0.912)

Table 4: Performance of Online inference.

We tested whether parameter estimates learned with the addition of extra slices of data could improve the prediction performance of the DBN and determined that injecting three slices of data is sufficient to achieve the best performance (Table 5). No significant change can be seen when adding another slice. Table 6 clearly indicates that DBN has significant improvement over SOFA, qSOFA, MEWS, or SAPS-II scores, since all means of NRI and IDI are positive and no 95% confidence intervals include zero. Distributions of predicted probabilities of death are plotted in Fig 4 with respect to survivorship status. For the first SOFA, MEWS, and SAPS-II, great proportions of false positives and false negatives are observed regardless of the choice of cut-off point. There are similarities between max SOFA and DBN in the distributions of the predicted probabilities of death, while DBN has reduced proportion of false positive predictions.

0in0in 2 slices 3 slices 4 slices AUC 0.856(0.826 - 0.885) 0.91(0.888 - 0.933) 0.901(0.888 - 0.915) Sensitivity 0.749(0.668 - 0.829) 0.814(0.773 - 0.856) 0.814(0.746 - 0.881) Specificity 0.839(0.801 - 0.877) 0.880(0.851 - 0.909) 0.886(0.843 - 0.928) PPV 0.376(0.340 - 0.412) 0.492(0.436 - 0.549) 0.483(0.407 - 0.559) NPV 0.963(0.953 - 0.973) 0.971(0.965 - 0.977) 0.974(0.967 - 0.981) F measure 0.500(0.470 - 0.530) 0.613(0.571 - 0.655) 0.605(0.558 - 0.651)

Table 5: Effect of adding slices: network with 4-hour rolled-back.


Table 6: Reclassification statistics with respect to DBN as the updated classifier: mean (95% CI).
Initial classifier Continuous NRI IDI
First SOFA 1.23 (1.20 - 1.34) 0.48 (0.47 - 0.50)
Maximal SOFA 1.20 (1.17 - 1.24) 0.41 (0.39 - 0.42)
SAPS-II 1.09 (1.05 – 1.12) 0.34 (0.33 - 0.36)
MEWS 1.23 (1.20 – 1.26) 0.52 (0.51 – 0.54)
qSOFA 1.24 (1.21 – 1.28) 0.55 (0.54 – 0.57)

NRI = Net Reclassification Index; IDI = Integrated Discrimination Improvement

Fig 4: Distribution of the predicted probability of death in the non-survivors and survivors.


We developed a DBN based algorithm for scoring the probability of death of ICU patients using ‘near event‘ data. The endpoint of in-hospital mortality has become the standard metric for early warning systems assessing risk for yet-to-be identified sepsis patients in the ICU. ICU mortality risk prediction tools such as SOFA, SAP-II, and MEWS have been available for some time and have been of limited assistance to the management of individual patients. With the new Sepsis-3 standards published by The Sepsis Definitions Task Force[13] for identifying at risk patients for this potentially deadly disease, SOFA has taken on a new importance in the ICU as a basis for decision support tools designed to expedite sepsis treatments and improve outcomes. The new definition calls for the detection of a simple change in SOFA score of ≥ 2 points from baseline to identify at risk patients suspected of having an infection and is the basis of many alerting tools. Knowing the highly nonlinear, chaotic nature of sepsis, we believe that a more specific, sophisticated method employing machine learning using time series data can more accurately and precisely assess individualized patient risk. Our DBN model achieves this and, by accurate predictions hours before events, may potentially contribute to efforts by bedside and discharge clinicians to reduce sepsis mortality.

Three aspects may explain the improved performance of the DBN model.

First, compared to static parametric models commonly used for machine learning in medicine (e.g. logistic regression), the nonlinear temporally dynamic relationships between variables in a complex system such as the human body responding to an infection in a dysregulated manner

[13] may be more adequately captured by a highly interconnected DBN network structures that explicitly model the temporal dynamics of features.

Second, beyond the overall structure of the DBN, the ontological models used in this study provided semantic characterizations of raw EMR data that were then used as training features. For example, the ontology characterized recorded events such as administration of a specific medication (e.g. cefazolin) in a patient with a related event “surgery” and no other evidence of infection as a “prophylactic antibiotic” distinguished infected from non-infected patients. Similarly, semantic characterizations of events such as abnormal vitals immediately following surgery or abnormal labs otherwise “explained by” medications (e.g. elevated INR following anticoagulant use) or chronic conditions distinguished septic-induced acute organ dysfunctions from chronic conditions. We believe these reductions of “ambiguous feature noise” served to improve recognition of valid physiological data signatures associated with mortality outcomes.

Third, unlike variable transformation (giving scores based on different thresholds) used by SOFA, MEWS, or SAPS scoring approaches, continuous versions of variables are used in DBN parameter learning. The threshold-based data transformation method may lead to information loss, indicated by the results from an experiment completed by Pirracchio et al[49]. As a side experiment, they compared the performance of original SAPS-II with a logistic model using all untransformed SAPS-II predictors as independent variables, and found the prediction from the logistic model to be better.

Another advantage of the study is that all ICU stays are included in this analysis. It is possible that measurements from multiple ICU stays of a single patient may be strongly correlated. Such correlation may have potential impact on model performance if it is not taken into consideration[82]. This is one limitation identified by the Super ICU Learner Algorithm (SICULA) project[49]. However, the performance of DBN suggests the impact is limited. Compared to the SICULA study in which only single stay from each patient was included, this DBN with all ICU stays still outperforms SICULA.

It should be noted that models such as SOFA were not designed to be implemented by machines/EHRs, but rather these are bedside tools designed to be used by the clinician. When making evaluations, clinicians consider all additional factors of each individual patient. For example, the INR being elevated because of warfarin use, not because of sepsis and DIC. Such factors are not typically accounted for by the existing scoring tools.

SOFA criteria for organ dysfunction associated with sepsis is defined as an “acute change in total SOFA score ¿= 2 points consequent to infection”, which requires evaluation of 2 key concepts which cannot be directly clinically measured: “acute change” and “consequent to”. Clinicians can use complex reasoning and experience over contextual knowledge of a patient clinical history, risk factors and current treatments to adjudicate if the presence of SOFA score ¿= 2 points in a time series of physiological data is in fact an “acute change” from some patient “baseline” and this this change is “consequent to” an infection leading to a diagnosis of sepsis as opposed to a myriad of other possible non-infectious explanations for the change. This reasoning may be based on a blend of evidence-based rules as well as years of clinical experience with similar change patterns in infected patients. DBN models can capture the temporal dynamics over large populations enhancing rule-based tools with experiential knowledge derived by learning over large datasets.

The clinical utility of a high precision ‘near event‘ predictive model is unknown and will require evaluation in pilot studies. As suggested, a tool such as the derived DBN may be useful to discharge clinicians in identifying patients near ICU discharge that are more prone to readmission or death and inform enhanced transfer decisions. Another limitation of the study is that we used SOFA and SAPS-II as reference scores while updated versions are available. This is partly due to a limitation of MIMIC-II. Some of the predictors included in the updated version of scores, e.g. SAPS-III and APACHE-III, are not directly available in the MIMIC-II database. However, it has been noted that the updated versions of scores are associated with similar drawbacks as the old version[23, 83].

Another drawback of MIMIC-II is that all samples come from a single hospital, which may lead to limited generalizability due to limited case heterogeneity. However, our Cmap-driven DBN concept embraces iterative intra-institutional continuous improvement (retraining) of the model as new evidence, experience and new data dictates, and calibration/re-validation in new institutional settings. Cmaps can be revised and refined by clinicians using web-based tools to iteratively improve semantic model performance in interpreting patient data, with consequential benefits to subsequent ML. Additionally, as noted above, data driven-based predictive models work best in derived populations. The reuse of ontological structures combined with availability of powerful arrays of cloud-based cluster computing resources may soon enable the concept of “near real-time machine learning for re-calibration/validation” using new patient encounter datasets as they occur in real time.

In summary, we constructed a DBN based on clinician constructed Cmaps and found that its ability to predict mortality outperforms the more conventional scores SOFA, qSOFA, MEWS, SAPS-II and the more recently proposed SICULA model. In future work, we plan to confirm the utility of our highly predictive ‘near event‘ model as a decision support tool and validate the algorithm using external data for improved case heterogeneity. While additional validation work remains to be done, the promising DBN algorithm could be valuable as an alternative or addition to conventional scores in clinical setting.


  •  1. Thiel SW, Rosini JM, Shannon W, Doherty JA, Micek ST, Kollef MH. Early prediction of septic shock in hospitalized patients. Journal of hospital medicine. 2010;5(1):19–25.
  •  2. Liu V, Escobar GJ, Greene JD, Soule J, Whippy A, Angus DC, et al. Hospital deaths in patients with sepsis from 2 independent cohorts. Jama. 2014;312(1):90–92.
  •  3. Henry KE, Hager DN, Pronovost PJ, Saria S. A targeted real-time early warning score (TREWScore) for septic shock. Science Translational Medicine. 2015;7(299):299ra122–299ra122.
  •  4. Cretikos M, Chen J, Hillman K, Bellomo R, Finfer S, Flabouris A, et al. The objective medical emergency team activation criteria: a case–control study. Resuscitation. 2007;73(1):62–72.
  •  5. Churpek MM, Yuen TC, Park SY, Gibbons R, Edelson DP. Using electronic health record data to develop and validate a prediction model for adverse outcomes on the wards. Critical care medicine. 2014;42(4):841.
  •  6. Moss TJ, Lake DE, Calland JF, Enfield KB, Delos JB, Fairchild KD, et al. Signatures of subacute potentially catastrophic illness in the ICU: model development and validation. Critical care medicine. 2016;44(9):1639–1648.
  •  7. Angus DC, Van Der Poll T. Severe sepsis and septic shock. New England Journal of Medicine. 2013;369(9):840–851.
  •  8. Zuev SM, Kingsmore SF, Gessler DD. Sepsis progression and outcome: a dynamical model. Theor Biol Med Model. 2006;3:8.
  •  9. Kumar A, Roberts D, Wood KE, Light B, Parrillo JE, Sharma S, et al. Duration of hypotension before initiation of effective antimicrobial therapy is the critical determinant of survival in human septic shock. Critical care medicine. 2006;34(6):1589–1596.
  •  10. Han YY, Carcillo JA, Dragotta MA, Bills DM, Watson RS, Westerman ME, et al. Early reversal of pediatric-neonatal septic shock by community physicians is associated with improved outcome. Pediatrics. 2003;112(4):793–799.
  •  11. Buist MD, Moore GE, Bernard SA, Waxman BP, Anderson JN, Nguyen TV. Effects of a medical emergency team on reduction of incidence of and mortality from unexpected cardiac arrests in hospital: preliminary study. Bmj. 2002;324(7334):387–390.
  •  12. Mao Y, Chen W, Chen Y, Lu C, Kollef M, Bailey T. An integrated data mining approach to real-time clinical monitoring and deterioration warning. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM; 2012. p. 1140–1148.
  •  13. Singer M, Deutschman CS, Seymour CW, Shankar-Hari M, Annane D, Bauer M, et al. The third international consensus definitions for sepsis and septic shock (sepsis-3). Jama. 2016;315(8):801–810.
  •  14. Seymour CW, Liu VX, Iwashyna TJ, Brunkhorst FM, Rea TD, Scherag A, et al. Assessment of clinical criteria for sepsis: for the Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3). Jama. 2016;315(8):762–774.
  •  15. Freund Y, Lemachatti N, Krastinova E, Van Laer M, Claessens YE, Avondo A, et al. Prognostic accuracy of sepsis-3 criteria for in-hospital mortality among patients with suspected infection presenting to the emergency department. Jama. 2017;317(3):301–308.
  •  16. Besen BAMP, Romano TG, Nassar AP, Taniguchi LU, Azevedo LCP, Mendes PV, et al. Sepsis-3 definitions predict ICU mortality in a low–middle-income country. Annals of intensive care. 2016;6(1):107.
  •  17. Raith EP, Udy AA, Bailey M, McGloughlin S, MacIsaac C, Bellomo R, et al. Prognostic accuracy of the SOFA score, SIRS criteria, and qSOFA score for in-hospital mortality among adults with suspected infection admitted to the intensive care unit. Jama. 2017;317(3):290–300.
  •  18. Vincent JL, Moreno R, Takala J, Willatts S, De Mendonça A, Bruining H, et al. The SOFA (Sepsis-related Organ Failure Assessment) score to describe organ dysfunction/failure. Intensive care medicine. 1996;22(7):707–710.
  •  19. Le Gall JR, Lemeshow S, Saulnier F. A new simplified acute physiology score (SAPS II) based on a European/North American multicenter study. Jama. 1993;270(24):2957–2963.
  •  20. Knaus WA, Wagner DP, Draper EA, Zimmerman JE, Bergner M, Bastos PG, et al. The APACHE III prognostic system: risk prediction of hospital mortality for critically III hospitalized adults. Chest. 1991;100(6):1619–1636.
  •  21. Lemeshow S, Teres D, Klar J, Avrunin JS, Gehlbach SH, Rapoport J. Mortality Probability Models (MPM II) based on an international cohort of intensive care unit patients. Jama. 1993;270(20):2478–2486.
  •  22. Nassar AP, Mocelin AO, Nunes ALB, Giannini FP, Brauer L, Andrade FM, et al. Caution when using prognostic models: a prospective comparison of 3 recent prognostic models. Journal of critical care. 2012;27(4):423–e1.
  •  23. Poole D, Rossi C, Latronico N, Rossi G, Finazzi S, Bertolini G, et al. Comparison between SAPS II and SAPS 3 in predicting hospital mortality in a cohort of 103 Italian ICUs. Is new always better? Intensive care medicine. 2012;38(8):1280–1288.
  •  24. Kuzniewicz MW, Vasilevskis EE, Lane R, Dean ML, Trivedi NG, Rennie DJ, et al. Variation in ICU risk-adjusted mortality: impact of methods of assessment and potential confounders. CHEST Journal. 2008;133(6):1319–1327.
  •  25. Gardner-Thorpe J, Love N, Wrightson J, Walsh S, Keeling N. The value of Modified Early Warning Score (MEWS) in surgical in-patients: a prospective observational study. The Annals of The Royal College of Surgeons of England. 2006;88(6):571–575.
  •  26. Reini K, Fredrikson M, Oscarsson A. The prognostic value of the Modified Early Warning Score in critically ill patients: a prospective, observational study. European Journal of Anaesthesiology (EJA). 2012;29(3):152–157.
  •  27. Desautels T, Calvert J, Hoffman J, Jay M, Kerem Y, Shieh L, et al. Prediction of sepsis in the intensive care unit with minimal electronic health record data: a machine learning approach. JMIR medical informatics. 2016;4(3).
  •  28. Mani S, Ozdas A, Aliferis C, Varol HA, Chen Q, Carnevale R, et al. Medical decision support using machine learning for early detection of late-onset neonatal sepsis. Journal of the American Medical Informatics Association. 2014;21(2):326–336.
  •  29. Horng S, Sontag DA, Halpern Y, Jernite Y, Shapiro NI, Nathanson LA. Creating an automated trigger for sepsis clinical decision support at emergency department triage using machine learning. PloS one. 2017;12(4):e0174708.
  •  30. Taylor RA, Pare JR, Venkatesh AK, Mowafi H, Melnick ER, Fleischman W, et al. Prediction of In-hospital Mortality in Emergency Department Patients With Sepsis: A Local Big Data–Driven, Machine Learning Approach. Academic Emergency Medicine. 2016;23(3):269–278.
  •  31. Goldstein BA, Navar AM, Pencina MJ, Ioannidis J. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. Journal of the American Medical Informatics Association. 2017;24(1):198–208.
  •  32. Danner OK, Hendren S, Santiago E, Nye B, Abraham P. Physiologically-based, predictive analytics using the heart-rate-to-Systolic-Ratio significantly improves the timeliness and accuracy of sepsis prediction compared to SIRS. The American Journal of Surgery. 2017;213(4):617–621.
  •  33. Schlapbach LJ, MacLaren G, Festa M, Alexander J, Erickson S, Beca J, et al. Prediction of pediatric sepsis mortality within 1 h of intensive care admission. Intensive Care Medicine. 2017; p. 1–12.
  •  34. Haskins IN, Maluso PJ, Amdur R, Agarwal S, Sarani B. Predictors of Mortality after Emergency General Surgery: An NSQIP Risk Calculator. Journal of the American College of Surgeons. 2016;223(4):S58–S59.
  •  35. Houthooft R, Ruyssinck J, van der Herten J, Stijven S, Couckuyt I, Gadeyne B, et al. Predictive modelling of survival and length of stay in critically ill patients using sequential organ failure scores. Artificial intelligence in medicine. 2015;63(3):191–207.
  •  36. Mann-Salinas LEA, Engebretson J, Batchinsky AI. A complex systems view of sepsis: implications for nursing. Dimensions of Critical Care Nursing. 2013;32(1):12–17.
  •  37. Kilic YA, Kilic I, Tez M. Sepsis and multiple organ failure represent a chaotic adaptation to severe stress which must be controlled at nanoscale. Critical Care. 2009;13(6):424.
  •  38. Saliba S, Kilic YA, Uranues S. Chaotic nature of sepsis and multiple organ failure cannot be explained by linear statistical methods. Critical Care. 2008;12(2):417.
  •  39. Tang CH, Middleton PM, Savkin AV, Chan GS, Bishop S, Lovell NH. Non-invasive classification of severe sepsis and systemic inflammatory response syndrome using a nonlinear support vector machine: a preliminary study. Physiological measurement. 2010;31(6):775.
  •  40. Quinn JA, Williams CK, McIntosh N. Factorial switching linear dynamical systems applied to physiological condition monitoring. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2009;31(9):1537–1551.
  •  41. Saeed M, Villarroel M, Reisner AT, Clifford G, Lehman LW, Moody G, et al. Multiparameter Intelligent Monitoring in Intensive Care II (MIMIC-II): a public-access intensive care unit database. Critical care medicine. 2011;39(5):952.
  •  42. Lee CH, Arzeno NM, Ho JC, Vikalo H, Ghosh J.

    An imputation-enhanced algorithm for ICU mortality prediction.

    In: Computing in Cardiology (CinC), 2012. IEEE; 2012. p. 253–256.
  •  43. Xia H, Daley BJ, Petrie A, Zhao X. A neural network model for mortality prediction in ICU. In: Computing in Cardiology (CinC), 2012. IEEE; 2012. p. 261–264.
  •  44. Johnson AE, Dunkley N, Mayaud L, Tsanas A, Kramer AA, Clifford GD. Patient specific predictions in the intensive care unit using a Bayesian ensemble. In: Computing in Cardiology (CinC), 2012. IEEE; 2012. p. 249–252.
  •  45. McMillan S, Chia CC, Van Esbroeck A, Rubinfeld I, Syed Z. ICU mortality prediction using time series motifs. In: Computing in Cardiology (CinC), 2012. IEEE; 2012. p. 265–268.
  •  46. Nachimuthu SK, Haug PJ. Early detection of sepsis in the emergency department using Dynamic Bayesian Networks. In: AMIA Annual Symposium Proceedings. vol. 2012. American Medical Informatics Association; 2012. p. 653.
  •  47. Sandri M, Berchialla P, Baldi I, Gregori D, De Blasi RA. Dynamic Bayesian Networks to predict sequences of organ failures in patients admitted to ICU. Journal of biomedical informatics. 2014;48:106–113.
  •  48. Peelen L, de Keizer NF, de Jonge E, Bosman RJ, Abu-Hanna A, Peek N. Using hierarchical dynamic Bayesian networks to investigate dynamics of organ failure in patients in the Intensive Care Unit. Journal of biomedical informatics. 2010;43(2):273–286.
  •  49. Pirracchio R, Petersen ML, Carone M, Rigon MR, Chevret S, van der Laan MJ. Mortality prediction in intensive care units with the Super ICU Learner Algorithm (SICULA): a population-based study. The Lancet Respiratory Medicine. 2015;3(1):42–52.
  •  50. Bloos F, Reinhart K. Rapid diagnosis of sepsis. Virulence. 2014;5(1):154–160.
  •  51. Srinivasan L, Harris MC. New technologies for the rapid diagnosis of neonatal sepsis. Current opinion in pediatrics. 2012;24(2):165–171.
  •  52. Novak JD, Musonda D.

    A twelve-year longitudinal study of science concept learning.

    American Educational Research Journal. 1991;28(1):117–153.
  •  53. Dellinger RP, Levy MM, Rhodes A, Annane D, Gerlach H, Opal SM, et al. Surviving Sepsis Campaign: international guidelines for management of severe sepsis and septic shock, 2012. Intensive care medicine. 2013;39(2):165–228.
  •  54. Hoehndorf R, Schofield PN, Gkoutos GV. The role of ontologies in biological and biomedical research: a functional perspective. Briefings in bioinformatics. 2015;16(6):1069–1080.
  •  55. Goossen WT. Detailed clinical models: representing knowledge, data and semantics in healthcare information technology. Healthcare informatics research. 2014;20(3):163–172.
  •  56. Brewer A, Helfgott MA, Novak J, Schanhals R. an application of Cmaps in the description of Clinical Information Structure and Logic in Electronic Health Records. Global Advances in Health and Medicine. 2012;1(4):16–31.
  •  57. Kappen TH, Vergouwe Y, van Klei WA, van Wolfswinkel L, Kalkman CJ, Moons KG. Adaptation of clinical prediction models for application in local settings. Medical Decision Making. 2012;32(3):E1–E10.
  •  58. Sepanski RJ, Godambe SA, Mangum CD, Bovat CS, Zaritsky AL, Shah SH. Designing a pediatric severe sepsis screening tool. Frontiers in pediatrics. 2014;2.
  •  59. Harris ZS. Distributional structure. Word. 1954;10(2-3):146–162.
  •  60. Apostolova E, Velez T. Toward Automated Early Sepsis Alerting: Identifying Infection Patients from Nursing Notes. BioNLP 2017. 2017; p. 257–262.
  •  61. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;.
  •  62. Chapman WW, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of biomedical informatics. 2001;34(5):301–310.
  •  63. Shavdia D. Septic shock: Providing early warnings through multivariate logistic regression models. Massachusetts Institute of Technology; 2007.
  •  64. Krawczyk B. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence. 2016;5(4):221–232.
  •  65. Box GE, Cox DR. An analysis of transformations. Journal of the Royal Statistical Society Series B (Methodological). 1964; p. 211–252.
  •  66. Koller D, Friedman N. Probabilistic graphical models: principles and techniques. MIT press; 2009.
  •  67. Lin JH, Haug PJ. Exploiting missing clinical data in Bayesian network modeling for predicting medical problems. Journal of biomedical informatics. 2008;41(1):1–14.
  •  68. Almeida E, Ferreira P, Vinhoza TT, Dutra I, Borges P, Wu Y, et al. Expert Bayes: Automatically Refining Manually Built Bayesian Networks. In: Machine Learning and Applications (ICMLA), 2014 13th International Conference on. IEEE; 2014. p. 362–366.
  •  69. Dagum P, Luby M. Approximating probabilistic inference in Bayesian belief networks is NP-hard. Artificial intelligence. 1993;60(1):141–153.
  •  70. Chickering DM, Heckerman D, Meek C. Large-sample learning of Bayesian networks is NP-hard. Journal of Machine Learning Research. 2004;5(Oct):1287–1330.
  •  71. Murphy KP. Dynamic bayesian networks. Probabilistic Graphical Models, M Jordan. 2002;7.
  •  72. Naeini MP, Cooper GF. Binary Classifier Calibration Using an Ensemble of Linear Trend Estimation. In: Proceedings of the 2016 SIAM International Conference on Data Mining. SIAM; 2016. p. 261–269.
  •  73. LeDell E, Petersen M, van der Laan M. Computationally efficient confidence intervals for cross-validated area under the ROC curve estimates. Electronic journal of statistics. 2015;9(1):1583.
  •  74. Iankova A. The Glasgow Coma Scale: clinical application in emergency departments. Emerg Nurse. 2006;14(8):30–35.
  •  75. Joseph B, Pandit V, Aziz H, Kulvatunyou N, Zangbar B, Green DJ, et al. Mild traumatic brain injury defined by Glasgow Coma Scale: Is it really mild? Brain Inj. 2015;29(1):11–16.
  •  76. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation. 2007;115(7):928–935.
  •  77. Cook NR. Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clinical chemistry. 2008;54(1):17–23.
  •  78. Pencina MJ, D’Agostino RB, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Statistics in medicine. 2008;27(2):157–172.
  •  79. Kundu S, Aulchenko YS, van Duijn CM, Janssens ACJ. PredictABEL: an R package for the assessment of risk prediction models. European journal of epidemiology. 2011;26(4):261.
  •  80. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez JC, et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC bioinformatics. 2011;12(1):77.
  •  81. Kuhn M. Caret package. Journal of Statistical Software. 2008;28(5):1–26.
  •  82. Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: A review. Statistical methods in medical research. 2014;23(1):42–59.
  •  83. Sakr Y, Krauss C, Amaral AC, Réa-Neto A, Specht M, Reinhart K, et al. Comparison of the performance of SAPS II, SAPS 3, APACHE II, and their customized prognostic models in a surgical intensive care unit. British journal of anaesthesia. 2008;101(6):798–803.

Appendix 1. Calibration plots

Cox calibration method is applied to assess calibration property of each score. Denote by

the predicted probability of mortality from each scoring algorithm, then Cox calibration regress the observed log-odds of mortality on the predicted log-odds:

Where and are coefficients. When there is a perfect calibration between the observed probability and predicted probability, and . Thus, the estimated and

based on data will be tested with respect to the null hypothesis

using a U-statistic, to see the degree of deviation from ideal calibration.

Among all scoring methods, only SAPS-II shows severe deviation from ideal calibration. Its estimated and , which gives and . For DBN, the estimated , and , which is close to the null values ().

Fig 5: SOFA first:
Fig 6: SOFA max:
Fig 7: qSOFA:
Fig 8: MEWS:
Fig 9: SAPS-II:
Fig 10: DBN: