The accumulation of healthcare data today has reached unprecedented levels: NHS datasets alone record billions of patient interactions every year [nhs-report]. In particular, due to the close monitoring of patients in an Intensive Care Unit (ICU), a wealth of data is generated for each patient[Johnson2016], with some information being recorded every minute. This information remains underexploited increasing pressure on the caregivers. Finding efficient ways to use the data from Electronic Health Records (EHRs) is crucial to reduce the burden in hospitals and improve patient outcomes.
In the typical setting, an EHR contains two types of information, which are structured (e.g. blood tests, temperature, lab results) and unstructured information (e.g. nursing notes, radiology reports, discharge summaries), with the latter composing the biggest part (typically, up to 80% [Kong2019]). Both types of information are valuable for the ICU monitoring. The majority of recent research [harutyunyan2019multitask, Subudhi2021, Alves2019]
relies though on more straightforward structured information, typically being laboratory test results and vital signs. On the other hand, unstructured information is underexplored and typically two approaches are applied: raw clinical text is input into a classifier for end-to-end prediction on its basis[Krishnan2018, Grnarova2016, Marafino2018, Yu2020] or the text is used to extract categorical features at the first step which are then used for prediction. Those categorical features could be pre-extracted medical concepts: e.g., diseases[Cooley2021, Ye2020], symptoms[Dreisbach2019], and medications[Chen2020, Kim2020].
Among the unstructured data, the phenotype111In the medical text, the word “phenotype” refers to deviations from normal morphology, physiology, or behavior [robinson2012deep]. has been received the least attention for the ICU monitoring [Cooley2021, Ye2020]. This is mainly due to the challenge to extract the phenotypic information expressed by a variety of contextual synonyms. For example, such a phenotype as Hypotension can be expressed in text as “low blood pressure”. However, the phenotype is crucial for understanding disease diagnosis, identifying important disease-specific information, stratifying patients and identifying novel disease subtypes [aerts2006gene, deisseroth2019clinphen, Liu2019, son2018deep, Xu2020]
. There exists a range of clinical Natural Language Processing (NLP) approaches that allow to extract phenotypic information from clinical text in an unsupervised manner[arbabi2019ncr, Tchechmedjiev2018, deisseroth2019clinphen]. There is thus a possibility to exploit this information in an automatic way, for example, to predict mortality in septic patients [Cooley2021].
Our work thoroughly investigates the value of phenotypic information as extracted from text for ICU monitoring. We automatically extract mentions of phenotypes from clinical text using a self-supervised methodology with recent advancements in clinical NLP - contextualised word embeddings [alsentzer-etal-2019-publicly] that are particularly helpful for the detection of contextual synonyms. We extract those mentions for over 15,000 terms of the Human Phenotype Ontology (HPO) [kohler2017human]
. We enrich the phenotypic features extracted in this manner with the information coming from the structured data (i.e., bedside measurements and laboratory test results) and use both types of features for a series of non-neural and neural classification algorithms. To provide interpretation into our results we use Shapley values[Lundberg2017original]
, a methodology that comes from the game theory and that ranks the inputs of a system according to their importance for each individual prediction. With the help of this methodology, we are able to provide highly relevant explanations for a complete picture of a patient’s status both at the individual and cohort levels.
We benchmark our approach on the following three mainstream ICU tasks [harutyunyan2019multitask]: length of stay, in-hospital mortality and physiological decompensation. Length of stay is one of the major drivers for operative costs and resource allocation in hospitals. Understanding it in advance can be highly useful for decision-making and better resource planning. The task of in-hospital mortality prediction is important for early identification of patients at-risk and has been a focus of the research interest for quite a while, as evidenced by the existence of multiple scoring systems designed to tackle the task (e.g., APACHE[Knaus1991apache], SAPS[LeGall1993saps]). Knowing which patients are more vulnerable and prone to decompensation (physiological decompensation task) allows clinicians to better monitor and track these patients. Many warning scores methods have been developed for this task as well (e.g., MEWS[Gardner2006mews], NEWS[Smith2019news]).
Our main contributions are: (i) approach to incorporate phenotypic features into the modelling of ICU time-series prediction tasks; (ii) investigation of the importance of the phenotypic features in combination with structured information for the prediction of patient course at micro (individual patient) and macro (cohort) levels; (iii) thorough interpretability study demonstrating the importance of phenotypic features and structured information for the ICU cases; (iv) demonstration of the utility of automatic phenotyping for ICU use cases.
For our experiments, we use the publicly available ICU data from the MIMIC-III [Johnson2016mimic] database. Following [harutyunyan2019multitask], we formulate the in-hospital mortality problem as a binary classification at 48 hours after admission, in which the label indicates whether the patient dies before discharge. We formulate the problem of physiological decompensation as a binary classification, in which the target label corresponds to whether the patient will die in the next 24 hours. We cast the length of stay (LOS) prediction task as a multi-class classification problem, where the labels correspond to the remaining length of stay. Possible values are divided into 10 bins, one for the stays of less than a day, 7 bins for each day of the first week, another bin for the stays of more than a week but less than two, and the final bin for stays of more than two weeks.
For data extraction, we closely222As a difference, ICU stays without available clinical notes are excluded. follow [harutyunyan2019multitask]: from 61,533 ICU stays in MIMIC-III across 46,521 patients, we exclude hospital admissions with more than one ICU stay or transfers between ICU wards. Patients younger than 18 are excluded as well. Then, all data associated with the ICU stays is processed as follows: numerical values are mapped to a unified metric system and clinical notes are annotated for HPO phenotypes. We use our contextualised approach for phenotype extraction, as well as two others (self-)supervised annotators: Neural Concept Recognizer (NCR) [arbabi2019ncr] and ClinicalBERT [alsentzer-etal-2019-publicly]. These phenotypes are propagated forward across timesteps depending on a persistency rule manually created by an expert clinician. This rule is defined so that phenotypes that would typically last an entire ICU stay (persistent) are propagated until the end, while those that could resolve before (transient) are propagated only until the appearance of the next clinical note. 15% of all patients are reserved for testing, while the remaining 85% are further divided for a 4-fold cross validation. All splits are fixed and shared across all tasks and classification settings.
For in-hospital mortality, stays for which LOS is shorter than 48 hours or unknown are further excluded. The number of patients, ICU episodes and timesteps per task are reported in Table 1. Mortality rate across all patients is 13.12% and decompensation rate across all timesteps is 2.01%. The distribution of ICU stays per LOS class is presented in Table 2. The number of unique phenotypes annotated for the cohort is presented in Table 3. We notice our phenotyping model detects the highest number of phenotypes in two out of three tasks given its optimisation to increase recall. Overall, we extract around 1,146 (651 for ClinicalBERT, 1381 for NCR, 1405 for our phenotyping model), 1,207 (674 for ClinicalBERT, 1471 for NCR, 1475 for our phenotyping model), and 1,202 (668 for ClinicalBERT, 1471 for NCR, 1460 for our phenotyping model) unique phenotypes for in-hospital mortality, physiological decompensation, and length of stay, respectively, 30% of them are persistent (on average across tasks). Most frequent phenotypes are abnormal blood oxygen level, functional respiratory abnormality, arrhythmia, abnormality of temperature regulation, abnormal cardiovascular system physiology, and increased blood pressure.
More details on the pre-processing procedure are found in the Methods section.
|Length of Stay||CV-1||5151||6245||532403||Refer to Table 2|
|Class Label||Class Description (Days)||CV-1||CV-2||CV-3||CV-4||Test|
|1||1 - 2||85311||83558||84065||85818||61372|
|2||2 - 3||56353||54074||54007||54780||38858|
|3||3 - 4||39416||37605||38106||38054||27142|
|4||4 - 5||29384||27982||28760||28573||20171|
|5||5 - 6||22830||22384||22360||22626||15878|
|6||6 - 7||18816||18612||18626||18582||12940|
|7||7 - 8||15925||15583||15697||15863||10953|
|8||8 - 14||62655||58512||59905||60611||40856|
|Tasks||Annotators||No. of Unique Phenotypes|
|Length of Stay||ClinicalBERT||668|
In general, we investigate the performance of three classifiers: Random Forest[Breiman2001]
(RF), Gradient Boosting[Chen2016xgboost]
(GB), and Long Short-Term Memory Network[Hochreiter1997lstm]
(LSTM). For each of them we investigate the following set of features: structured only (S) including 17 features such as blood pressure and fraction of inspired oxygen as well as their imputation masks discussed in SectionData preprocessing, and structured enriched with phenotypic features coming from one of the three annotators.
. Overall, they show that phenotypic information complements positively the structured information to improve performance on all tasks (on average 2.16% for in-hospital mortality, 0.53% for decompensation, and 0.83% for LOS across annotators and classifiers). The improvements with our phenotyping model are statistically significant across tasks with the exception of physiological decompensation with RF. Improvements from NCR are not significant for physiological decompensation with RF and GB, and LOS with RF. ClinicalBERT did not produce significant improvements for in-hospital mortality with RF and LSTM, physiological decompensation with RF, and LOS with RF and GB. For the particular case of LOS prediction, this is explained by the fact that phenotypic information brings into the picture highly valuable information such as response to therapy, development of complications and unmeasured indicators of severity illness, all of which are not captured by structured information nor by severity scores based on day 1 forecasts, and are fundamental to correctly estimate LOS[Kramer2017los]. For in-hospital mortality, improvements are explained by the fact that phenotypic information brings into the picture the patient’s comorbidities, which are known to be essential to accurately predict the mortality risk [Forte2019].
LSTMs [Hochreiter1997lstm] consistently outperform other algorithms due to their intrinsic ability to capture temporal relations, presenting average increases of 1.41%, 0.52% and 1.51% for in-hospital mortality, physiological decompensation and LOS, respectively. Setups across all classifiers with phenotypic features as extracted by our phenotyping model have tendency to outperform other similar setups on in-hospital mortality by 0.66%, on physiological decompensation by 0.42% and on LOS by 2.67%.
Note that clinically validated scores such as APACHE-III and SAPS-II are not competitive across settings.
Across all three tasks there is no clear winner between NCR and ClinicalBERT. Each of them can outperform the other without any clear tendency. Also, we find that the mean scores from cross-validation aggregate are comparable to the respective test set scores across all the tasks for all the models.
While we conclude that phenotypic information provides indeed relevant information to correctly conduct these tasks, decision support systems in the healthcare domain should be reliable, interpretable and robust. Therefore, we accompany the above results with a thorough study on interpretability, providing explanations both at the patient and cohort levels for the observed predictions, and an assessment of robustness by studying performances across disease-specific subcohorts.
|4-Fold Cross Validation Aggregate||Test Set|
|S||0.809||0.008||0.417||0.018||0.800 (0.775, 0.824)||0.339 (0.286, 0.395)|
|S + NCR||0.818||0.014||0.472||0.013||0.828 (0.802, 0.853)||0.467 (0.404, 0.529)|
|S + CB||0.803||0.011||0.423||0.004||0.812 (0.787, 0.838)||0.403 (0.345, 0.463)|
|S + Ours||0.829||0.008||0.458||0.017||0.833 (0.809, 0.857)||0.396 (0.340, 0.456)|
|S||0.785||0.046||0.419||0.024||0.793 (0.766, 0.820)||0.376 (0.318, 0.434)|
|S + NCR||0.811||0.015||0.433||0.025||0.830 (0.805, 0.855)||0.448 (0.386, 0.510)|
|S + CB||0.805||0.014||0.392||0.023||0.817 (0.791, 0.842)||0.380 (0.324, 0.441)|
|S + Ours||0.805||0.005||0.394||0.014||0.819 (0.794, 0.843)||0.377 (0.321, 0.437)|
|S||0.828||0.007||0.441||0.015||0.826 (0.801, 0.848)||0.391 (0.334, 0.452)|
|S + NCR||0.835||0.010||0.478||0.008||0.841 (0.818, 0.864)||0.453 (0.393, 0.513)|
|S + CB||0.828||0.007||0.459||0.006||0.826 (0.802, 0.849)||0.414 (0.355, 0.476)|
|S + Ours||0.844||0.004||0.495||0.013||0.845 (0.823, 0.868)||0.464 (0.405, 0.523)|
ICU In-Hospital Mortality Results with Area Under the Curve of Receiver Operating Characteristic (AUC-ROC) and Area Under the Curve of Precision-Recall (AUC-PR) Scores. Test set scores are shown with 95% confidence intervals. The best score for each classifier is highlighted in bold. Here, S refers to Structured, NCR to Neural Concept Recognizer[arbabi2019ncr], CB to ClinicalBERT, and Ours to our phenotyping model.
|4-Fold Cross Validation Aggregate||Test Set|
|S||0.815||0.003||0.127||0.009||0.826 (0.821, 0.831)||0.130 (0.123, 0.138)|
|S + NCR||0.820||0.003||0.124||0.007||0.825 (0.821, 0.830)||0.124 (0.118, 0.131)|
|S + CB||0.818||0.004||0.122||0.008||0.826 (0.821, 0.830)||0.125 (0.118, 0.132)|
|S + Ours||0.824||0.002||0.125||0.006||0.827 (0.823, 0.832)||0.126 (0.119, 0.134)|
|S||0.812||0.004||0.134||0.010||0.830 (0.825, 0.835)||0.154 (0.146, 0.163)|
|S + NCR||0.817||0.001||0.126||0.018||0.831 (0.827, 0.836)||0.124 (0.118, 0.131)|
|S + CB||0.816||0.004||0.122||0.023||0.834 (0.829, 0.838)||0.157 (0.149, 0.165)|
|S + Ours||0.823||0.005||0.131||0.014||0.838 (0.833, 0.842)||0.162 (0.154, 0.171)|
|S||0.819||0.002||0.135||0.016||0.824 (0.819, 0.829)||0.126 (0.119, 0.133)|
|S + NCR||0.820||0.003||0.133||0.012||0.834 (0.829, 0.839)||0.134 (0.127, 0.142)|
|S + CB||0.820||0.005||0.128||0.021||0.833 (0.828, 0.838)||0.114 (0.108, 0.119)|
|S + Ours||0.833||0.007||0.144||0.022||0.839 (0.834, 0.844)||0.145 (0.138, 0.153)|
|4-Fold Cross Validation Aggregate||Test Set|
|S||0.381||0.004||142.0||4.6||0.390 (0.388, 0.392)||136.8 (136.2, 137.4)|
|S + NCR||0.382||0.007||148.0||4.1||0.390 (0.388, 0.392)||142.5 (141.9, 143.1)|
|S + CB||0.369||0.005||149.2||3.7||0.376 (0.374, 0.379)||144.3 (143.7, 144.9)|
|S + Ours||0.395||0.006||146.2||4.6||0.405 (0.403, 0.407)||140.6 (140.0, 141.2)|
|S||0.387||0.001||155.9||2.1||0.399 (0.397, 0.401)||151.8 (151.1, 152.4)|
|S + NCR||0.366||0.005||150.2||3.0||0.381 (0.378, 0.383)||146.7 (146.1, 147.3)|
|S + CB||0.373||0.005||148.8||2.7||0.392 (0.390, 0.394)||143.5 (143.0, 144.1)|
|S + Ours||0.401||0.003||142.0||3.7||0.412 (0.410, 0.414)||138.9 (138.3, 139.5)|
|S||0.375||0.003||134.3||17.2||0.380 (0.377, 0.382)||157.0 (156.3, 157.6)|
|S + NCR||0.392||0.013||127.1||17.4||0.406 (0.404, 0.408)||123.3 (122.8, 123.9)|
|S + CB||0.374||0.015||127.6||8.6||0.388 (0.386, 0.390)||120.1 (119.6, 120.6)|
|S + Ours||0.416||0.011||116.1||6.9||0.430 (0.427, 0.432)||116.7 (116.1, 117.2)|
In this paper we work on three popular tasks in the ICU settings. Our work shows that these tasks are benefited by the inclusion of phenotypic information found inside clinical notes irrespective of the classification algorithms. Further, we show that our algorithm outperforms state-of-the-art models. In the following section, we illustrate how phenotypic information also improves reliability and provides clinicians with valuable explanations regarding the predictions.
During modelling we have found it beneficial to separate extracted phenotypes into persistent and transient, and then to propagate them forwards in time. Each phenotype was marked by one human clinical expert based on whether it would typically persist throughout an entire ICU stay or not. Consequently, transient (e.g., fever, cough, dyspnea) and persistent (e.g., diabetes, cancer) phenotypes are propagated until the appearance of a new clinical note or the end of the ICU stay, respectively. To investigate the value of this procedure more closely, we have performed an ablation study and assessed the contribution of our phenotypic propagation methodology. Results of our experiments are in Table 7.
They show that phenotypic propagation is particularly beneficial for such models as RF and GB. Given that the intrinsic properties of those models do not allow for temporal modelling, the propagation of phenotypic information is essential. Without it, performance is comparable to that of models trained purely on structured data.
Furthermore, the LSTM model for in-hospital mortality using phenotypes without propagation performed also worse than its counterpart with propagation but better than purely structured. Considering the short time window of 48 hours inside which information is collected, this seems to indicate that our knowledge-based propagation method is more useful than the temporal relationships the LSTM network can learn in such reduced time span.
Finally, models for physiological decompensation and LOS without phenotypic propagation produced statistically superior results to those of the networks trained with such propagation. Since these two tasks exploit the entire time course of the ICU stay, LSTM models can better capture temporal relationships by themselves given the larger amount of data to learn from. While all of our transient phenotypes are propagated until the next clinical note appears, the LSTM can learn by itself per-phenotype rules, possibly propagating some phenotypes beyond the next clinical note, while discarding some others immediately after their appearance, for instance. We theorise this finer granularity for phenotypic propagation produces the observed improvements.
We conclude these findings indicate that propagating phenotypic information according to our modelling is indeed the correct way of thinking, and that further investigation focused on learning individual persistencies for phenotypes would be highly beneficial, not only to boost performance, but also to provide insights about temporal duration of phenotypes in the ICU.
To further understand the contribution of phenotypic features to the prediction performance, we have studied the most important features with the help of Shapley values[Lundberg2017original]. This analysis and all involving Shapley values are done on the Gradient Boosting (GB) models. An illustration of our investigation is in Figure 1, where we present the top predicting features for in-hospital mortality and physiological decompensation. It confirms our previous observations over automatic metrics that phenotypic features are particularly helpful for the in-hospital mortality prediction, given that 14 out of the 20 most important features are phenotypes. This aligns with our expectations, and is explained by the fact that this task has no fixed point in the future at which to make predictions. In other words, since the prediction should comprehend the entire stay rather than predict whether the patient will die within the next 12 hours, forecasts need to rely on data that is highly informative but also that is able to provide insights accurately into the long-term future. Phenotypes, contrary to bedside measurements which can lose temporal correlation due to their dynamic nature, fulfill very well these criteria as they can capture, for instance, comorbidities, which are essential for predicting mortality [Forte2019] and constitute the entire input for clinically validated models to predict mortality [Charlson1987, Elixhauser1998]. Furthermore, another study including 230,000 ICU patients found that combining the comorbidities with acute physiological measurements yielded the best results, outperforming all mortality scores (APACHE II, SAPS II) [Nielsen2019].
The top ranking feature for mortality prediction is, unexpectedly, whether the patient experiences pain or not. Although not decisive, there is some initial evidence corroborating the fact that pain management improves outcomes in the ICU [Georgiou2015, Hasegawa2017]. However, pain could also be interpreted as a proxy for establishing a high level of consciousness, which has been correlated with better outcomes in the ICU [Bastos1993]. Other top ranking phenotypes included tachycardia, atrial arrhythmia, constitutional symptom, nausea and vomiting, abnormal pupillary function, abnormal pattern of respiration, abnormal prothrombin time, confusion, increased inflammatory response, abnormal respiratory system morphology, diarrhea, and dyspnea; that together cover most of the body systems (i.e., heart, lungs, GI tract, central nervous system, coagulation, infection, kidneys) which are typically assessed through clinically validated scores, including early warning scores (e.g., NEWS, MEWS) and mortality scores (e.g., Charlson Comorbodity Index[Charlson1987], Elixhauser Index[Elixhauser1998], APACHE, SAPS).
Our study also showed that while phenotypic features are not as important for decompensation as for in-hospital mortality, they are still useful because they provide a better estimation of the predicted risk. Given that this task concerns predicting mortality within the next 24 hours, bedside measurements become more informative thanks to their temporal correlation. This is reflected by our system due to the fact that only 4 out of the top 20 features for this task were phenotypic ones. Nevertheless, bedside measurements can be ambiguous or provide an incomplete picture of the patient’s status without the data found in clinical notes. Thus, these notes can give a baseline on which the prediction can be based. This situation is illustrated by Figure 2, where the top ranking feature is a phenotype. Even though this phenotype is marked as persistent and thus remains present throughout the ICU stay, it increases appropriately the risk of decompensation, giving overall a better estimation. This fact is further confirmed by the calibration plot for decompensation (Figure 3(a)), evidencing that the inclusion of phenotypic information leads to consistently more calibrated estimates, and by Figure 2(a), where it is observed that predictions for physiological decompensation benefit the most when the patient stays in the ICU for more than one week.
Similarly, the top features for length-of-stay are presented in Figure A2.
At our next step, we have investigated the calibration curves to assess how phenotypic information contributes to the reliability of the models.
Our investigation of the respective calibration curves (for LSTMs, the results are in Figure 4) shows that unstructured information improves model calibration, i.e. its correlation with actual probabilities. To summarise each curve we use the Brier score[Brier1950], namely the mean squared error between the observed curve and the ideal one (i.e., the identity line).
Our major observation is that phenotypic features improve calibration across setups, this is especially true for the features extracted with our methodology for physiological decompensation and in-hospital mortality. LSTMs with our features are the settings producing the lowest Brier score.
We have also noticed that physiological decompensation models are overall better calibrated, which is expected given that physiological decompensation has a near-future prediction scope (i.e., next 24 hours) compared to in-hospital mortality having no fixed time window.
Note that APACHE and SAPS (clinically validated mortality scores) are worse or only comparable to our weakest settings (i.e., only structured data).
Results for the other classifiers are presented in Figures A3 and A4. Overall, GBs exhibit similar calibration results for physiological decompensation and in-hospital mortality. Calibration with RFs for in-hospital mortality is also comparable. However, RFs for physiological decompensation present significantly worse calibrations.
Calibration curves for (a) physiological decompensation and (b) in-hospital mortality. Even when a system can accurately predict a label after thresholding in a classification problem, knowing the probability with which such system assigned a class provides further insights and interpretation. Calibration plots illustrate how interpretable such probabilities can be. Each curve is presented with its Brier score (the lower the better). Note that overall inclusion of unstructured data helps with calibration. Further, for both tasks, unstructured data annotated by our phenotyping model provides the best results. LSTM: Structured features with Long Short-Term Memory Neural Network. Ours, NCR, CB: phenotypic features from our phenotyping model, NCR and ClinicalBERT, respectively.
Beyond producing clinically relevant explanations at the cohort level, with the help of Shapley values we can shed light onto a patient’s journey and discover retrospectively when the patient was the most vulnerable and why. For example, the fragment of a patient’s LOS forecast in Figure 5 illustrates an estimated probability, after the 41st hour from admission, of a LOS longer than 14 days being of 69%, mainly because the patient scored 1 in the Glasgow Coma Scale Verbal Response. One hour after, when a clinical note becomes available, worrisome phenotypes appear (including edema, hypotension and abnormality of the respiratory system). Consequently, the estimated probability increased to 88%. While hypotension is a concept the classifier can learn from the blood pressure measurements, edema and abnormality of the respiratory system are not reflected on typical acute measurements, all of which indicate the usefulness of phenotypic information on these tasks.
Finally, we measure the generalisability of the systems to different cohorts of the patients. We assess the system’s performance when tested against specific patients’ cohorts, such as Diabetes, Cancer, Cardiovascular Diseases, and Depression. We split our test set in these disease-specific cohorts and compute the scores using our best classifiers based on LSTM trained with both structured and unstructured data obtained by our phenotypes extraction methodology. Results are shown in Table 8 for all the tasks. They show that our systems are robust to different cohorts of patients, including under-represented ones.
On the physiological decompensation task, we find all the cohorts show similar performance though all the scores are lower than the complete test set. So, our model appears to generalise well irrespective of the cohorts. Similarly on the in-hospital mortality task, the model generalises to all the cohorts with Cardiovascular Diseases and Depression getting lowest scores among all the cohorts.
Contrary, on the LOS task, it generalises to largest cohorts, i.e., Cardiovascular Diseases and Diabetes, well but the smallest cohorts, i.e., Cancer and Depression, show poor generalisability. It seems that for the physiological decompensation and in-hospital mortality tasks, nature of diseases affect the performance more than the sample size of the test set. For LOS task, smallest cohorts lag the performance which can be related to their sample size.
There are several directions to take our work further. We have investigated only one data source and our observations are to be confirmed with other data sources.
Finally, due to technical limitations, our explanations were produced on the Gradient Boosting, which, while being still superior to the baseline built with structured data only, is not our strongest model. Explanations from neural network based models are to be investigated in the future studies.
Finally, there are certain limitations to our modelling approach. While we introduce a new modelling approach for phenotypic features, the ones annotated as transient are made present only until a new clinical note appears in the timeline. This has the inconvenience that phenotypes might be prematurely considered as not present because the next available clinical note did not mention them. Even though the LSTM classifier is able to learn temporal correlations on its own, a more elaborated feature modelling could prove useful.
We extracted the data following the practices of the benchmark [harutyunyan2019multitask]. We follow their filtering criteria for the patients, admissions and ICU stays for all the 3 tasks. For in-hospital mortality, stays for which LOS is shorter than 48 hours or unknown are further excluded. In addition to the patients filtering criteria, we discard all the ICU episodes in which a clinical note is not recorded. This reduces our train and test data as compared to the benchmark, so we recalculate the baseline scores using their code on our new test set. These scores are reported in the first row of LSTM results in Tables 4, 5, and 6.
We follow the preprocessing333https://github.com/YerevaNN/mimic3-benchmarks
of the benchmark for the structured data. We impute all the missing values for the structured features with the most recent observation. In case it is not available, we use the normal reference value. We also create masks to differentiate imputed values from the observed ones. We group multiple observations in an hourly timestep by taking the most recent value. Further, we add the missing timesteps with the same imputation strategy. Then, we scale all the features to zero mean and unit variance.
For the unstructured data, we have high data sparsity as clinical notes are recorded infrequently. Roughly, they are recorded every 12 hours against the structured data which is recorded very frequently. We devise an imputation strategy to handle this issue. We divide the phenotypes into persistent and transient by the annotations of an expert clinician. If a phenotype was clinically deemed likely to last an entire admission in the vast majority of typical cases, it was marked as ‘persistent’. These phenotypes are generally chronic in nature which reflects a patient’s medical background such as cancer and tuberculosis. These conditions are unlikely to change throughout the patient’s ICU stay. All temporary phenotypes and those that generally last long, but which might resolve in some cases, are marked ‘transient’ to prevent a resolved phenotype from being carried forward too far. The latter phenotypes are temporary and can be acquired or improved during an ICU stay such as pain, fever, cough, etc. In practice, this means that a persistent phenotype is made available throughout the entire stay starting from the point it firstly appears. For a transient phenotype, we make it present from the moment it appears until a new clinical note appears. We replace all the phenotypes with their immediate parent using the Human Phenotype Ontology (HPO) [Kohler2021hpo] tree (unless the parent is “Phenotypic abnormality” which is the root node of phenotypes in HPO) to reduce the overall feature dimensionality. Overall, this produces a reduction of 40-60% in the number of features.
Algorithm development and analysis
To extract phenotypes from free-text clinical notes, we use our phenotyping model, which is a BERT-based system, trained firstly in a self-supervised fashion and then fine-tuned on manually annotated data with main focus on detecting contextual synonyms and increasing recall. Additionally, we also use ClinicalBERT [alsentzer-etal-2019-publicly] and NCR [arbabi2019ncr]
for comparison. NCR uses a convolutional neural network (CNN) to assign similarity scores to HPO concepts of phrases encoded by using pre-trained non-contextualised word embeddings. Our phenotyping model, instead, uses a BERT-based network to assign similarity scores between the contextualised word embeddings and pre-trained embeddings from a knowledge graph based on the HPO hierarchy. Our phenotyping model is further described in the Appendix.
Along with our phenotyping model, we use standard Machine Learning classifiers - Random Forest[Breiman2001], Gradient Boosting [Chen2016xgboost], and LSTM [Hochreiter1997lstm]. We do not use any explicit feature engineering besides distiguishing the phenotypic features into persistent and transient with any of the models.
We use Area Under the Curve of Receiver Operating Characteristic[Lasko2005aucroc] (AUC-ROC) and Area Under the Curve of Precision-Recall (AUC-PR) for In-Hospital Mortality and Physiological Decompensation tasks. We primarily rely on AUC-ROC for statistical analysis as it is threshold independent and used by the benchmark [harutyunyan2019multitask] as the primary metric. For the Length of Stay task, we use Cohen’s Kappa [Cohen1960kappa] and Mean Absolute Deviation [PhamGia2001mad] (MAD) with primarily relying on the Kappa scores for statistical analysis.
Explanation of predictions
To provide insights into our predictions we used Shapley values[strumbelj2014shapley], which are explained more in details in Appendix. These values are typically used to explain black box models, and allow us to quantify the importance of a feature and whether it impacts positively or negatively the outcome.
Model evaluation and statistical analysis
We use a train-test split based on the benchmark[harutyunyan2019multitask], but exclude patients without clinical notes, resulting in 21,346 and 3,824 patients for train and test set, respectively. Further, we perform 4-fold cross validation on the training set (5,344, 5,324, 5,357 and 5,321 patients for each fold). All splits are deterministic, so that all the classifiers with different data settings are trained and evaluated with the same subsets of data. We use the bootstrap resampling following the benchmark for statistical analysis of the scores. To compute confidence intervals on the test set we resample it 1000 times for length of stay and decompensation and 10000 times for in-hospital mortality task. Then, we compute the scores on the resampled data to calculate 95% confidence intervals.
The MIMIC-III database is available on PhysioNet repository [physionet2020mimiciii]. The code to generate the three ICU benchmarks from MIMIC-III is published by the study [harutyunyan2019multitask].
We would like to thank Dr. Rick Sax, Dr. Garima Gupta and Dr. Matt Wiener for their feedback throughout this research. We would also like to thank Dr. Garima Gupta, Dr. Deepa (M.R.S.H) and Dr. Ashok (M.S.) for helping us create gold-standard phenotype annotation data.
JZ, LB, AT and JI conceived the experiments. LB and AT conducted the experiments. JZ, LB, AT, AS and JI analysed the results. JI, VB and YG reviewed the research and manuscript. All authors approved the manuscript.
Competing interests: Vibhor Gupta and Yike Guo are founders of Pangaea Data Limited. Julia Ive is a consulting advisor to Pangaea Data Limited.
Our self-supervised contextualised phenotyping model
We represent tokens using contextualised embeddings by state-of-the-art pre-trained ClinicalBERT [alsentzer-etal-2019-publicly]. Those embeddings are pre-trained using the MIMIC-III data, on the top of the BERT [devlin-etal-2019-bert] and BioBERT [biobert] embeddings in-turn which are pre-trained using text corpora in general domain and biomedical domain respectively. BERT is based on the Transformer architecture [vasvani:nips:2018]
, current state-of-the-art in language understanding and text generation, which allows to model long range contextual dependencies (dependencies between words separated from each other by long unrelated text spans). In addition to the adoption of the Transformer architecture, the BERT novelty comes from the pre-training objective (so called “masked language modelling (MLM)”). This pre-training is fully unsupervised (or namely self-supervised) and consists in predicting the masked words in text by relying on their left and right contexts.
As shown in Figure A1 (Left), our phenotyping model consists of a Transformer Encoder which is identical to and initialised by the ClinicalBERT. Besides, there are two additional classifiers on the top of the Transformer Encoder which predict if a token is HPO-relevant and assigns HPO concepts to those HPO-relevant tokens.
We enrich the model with the following three training objectives: (1) Standard binary cross-entropy loss objective to predict , where represents a token in text and is 1 if is HPO-relevant, otherwise 0:
(2) Standard cross-entropy loss objective with softmax to predict , where is defined over the most frequent 400 HPO concepts found in the training data.
(3) Euclidean distance between the embedding of and the respective HPO concept embedding if is an infrequent HPO concept.
The intuition behind to increase precision in the resulting performance, while the loss objective is defined to increase recall of the resulting model and targets the detection of the rare and granular HPO concepts. The good performance can be reached by using both objectives, otherwise, for example, recall would suffer.
Additionally we follow the pre-training objective masked language model (MLM) by BERT [devlin-etal-2019-bert], which helps to maintain the semantics of actual words in the created representations.
Prior to the training of the proposed model, we also build the knowledge graph embeddings for HPO concepts based on the Transformer architecture. The model is designed to encode both the hierarchical connections between HPO concepts and the semantics in definitions of HPO concepts, so that the similar HPO concepts have similar embeddings and vice versa. We consider two learning objectives while training this model. The first learning objective is to encourage the neighbouring HPO concepts to have similar embeddings and non-neighbouring HPO concepts to have different embeddings.
where , is a neighbouring HPO concept of and is a set of non-neighbouring HPO concepts of . Second, we adopt the skip-gram negative sampling model [mikolov:nips:2013] to predict the appearance of tokens in HPO’s definitions.
In practice, the Transformer Encoder is also identical to and initialised by ClinicalBERT. The HPO embeddings are trained and obtained prior to the training of the proposed self-supervised model.
In the self-supervised pre-traing stage, we use 38,772 notes of brief hospital course in MIMIC-III’s discharge summaries and 1.5M generated notes by using data augmentation (see details below) which is also trained on MIMIC-III. We build our training examples without any additional annotation by humans by considering HPO names, synonyms, abbreviations, as well as respective concepts from Unified Medical Language System (UMLS) [Bodenreider2004] as provided by the ontology related to an HPO concept matched with exact match in clinical text as training samples for our proposed self-supervised model.
In the fine-tuning stage, we collect gold-standard phenotype annotation data with the help of three expert clinicians and finetune the pre-trained phenotyping model on the gold data. The EHRs were pre-annotated with HPO concepts by keyword matching, and then the annotations were corrected by the three clinicians with consensus. We have created our own sub-corpus of 242 discharge summaries from MIMIC-III with gold annotations. We used 146, 48, 48 of them for fine-tuning, validation and testing.
At the inference stage of phenotype annotation, we hypothesise and can detect the HPO-relevant spans of frequent HPO concepts with high precision, while those of rare HPO concepts can be found by the Euclidean distance between contextualised token embedding and HPO embedding with good recall. More precisely, we formalise the decision strategy as Algorithm 1.
Data Augmentation Techniques
In the self-supervised pre-training stage, there are two issues related to the creation of the training data by keyword matching: (1) this data can be too limited to help the model capture contextual synonyms of phenotypes; (2) rare HPO concepts may not be found in the clinical text used for training and the model will not be able to detect them at the inference stage. We address those two problems by data augmentation which creates textual variants for existing HPO-relevant spans by paraphrasing and uses a generative model to generate context around rare HPO concepts.
In the self-supervised setting, we report the performance of baselines (1) Keyword: a naive method that simply matches HPO names, synonyms and abbreviations to text spans, and (2) a variety of text mining models (Clinphen [deisseroth2019clinphen], NCBO [Tchechmedjiev2018], cTAKES [Savova2010], MetaMap [aronson2006metamap], MetaMapLite [demner2017metamaplite]
), and (3) two deep learning models (NCR[arbabi2019ncr], MedCAT [kraljevic2019medcat]) which are trained without supervision.
In the fine-tuned setting, we use (1) BERT-Base [devlin-etal-2019-bert], (2) BioBERT-Base v1.0 [biobert] pre-trained on PubMed and PMC, (3) Clinical BERT [alsentzer-etal-2019-publicly] pre-trained based on BioBERT and MIMIC-III discharge summaries, (4) SciBERT [beltagy-etal-2019-scibert] pre-trained for scientific literature.
We use micro-averaged Precision, Recall and F1-score at the document level and the HPO annotations are de-duplicated. Following the best practices by [Liu2019], we report metrics (1) Exact match: only the exact same HPO annotations were counted as correct. (2) Generalised match: both the predicted and target HPO annotations are first extended to include all ancestors in HPO up until Phenotypic Abnormality (HP:0000118) (exclusive).
Our primary observation is that our phenotyping model outperforms all the baselines with respect to F1 and recall scores in both settings for both the exact and generalised matches.
Shapley values come from game theory and are used to estimate the impact of a feature on a system’s output. Feature impact is defined as the variation in the output of the model when the feature is observed versus when it is unknown.
Shapley values belong to a category of methods denominated additive. In particular, the additivity is formulated as
where is the prediction made by the model, are the features fed to the model, is the number of features, is the Shapley value of the i-th feature, and is the expected value of the model over the training dataset. Also, this assumption ensures the values correctly reflect the difference between the expected model output and the output for a particular prediction.
The Shapley value of a feature is computed via
where is a subset of all input features, and with in a subset of the input features with only those belonging to present.
In this study we used the SHAP library [Lundberg2017original] and its optimisation for tree-based classifiers [Lundberg2020tree] to compute the Shapley values.
|Random Forest||num of estimators=300, criterion="gini", max depth=None, min samples split=2, min samples leaf=1|
|Gradient Boosting||learning rate=0.3, min split loss=0, max depth=6, min child weight=1, subsample=1|
|LSTM||epochs=30, hidden size=128, batch size=8, num of layers=1, patience=10, dropout rate=0, learning rate=1e-4, weight decay=0.0|
Feature importance for Length-of-Stay