Clinical Utility of the Automatic Phenotype Annotation in Unstructured Clinical Notes: ICU Use Cases

by   Jingqing Zhang, et al.

Clinical notes contain information not present elsewhere, including drug response and symptoms, all of which are highly important when predicting key outcomes in acute care patients. We propose the automatic annotation of phenotypes from clinical notes as a method to capture essential information to predict outcomes in the Intensive Care Unit (ICU). This information is complementary to typically used vital signs and laboratory test results. We demonstrate and validate our approach conducting experiments on the prediction of in-hospital mortality, physiological decompensation and length of stay in the ICU setting for over 24,000 patients. The prediction models incorporating phenotypic information consistently outperform the baseline models leveraging only vital signs and laboratory test results. Moreover, we conduct a thorough interpretability study, showing that phenotypes provide valuable insights at the patient and cohort levels. Our approach illustrates the viability of using phenotypes to determine outcomes in the ICU.



There are no comments yet.


page 1

page 11

page 18


Early Prediction of Mortality in Critical Care Setting in Sepsis Patients Using Structured Features and Unstructured Clinical Notes

Sepsis is an important cause of mortality, especially in intensive care ...

Explainable Prediction of Adverse Outcomes Using Clinical Notes

Clinical notes contain a large amount of clinically valuable information...

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

Outcome prediction from clinical text can prevent doctors from overlooki...

Hierarchical Semantic Correspondence Learning for Post-Discharge Patient Mortality Prediction

Predicting patient mortality is an important and challenging problem in ...

DeepClean -- self-supervised artefact rejection for intensive care waveform data using generative deep learning

Waveform physiological data is important in the treatment of critically ...

Racial Disparities and Mistrust in End-of-Life Care

There are established racial disparities in healthcare, including during...

Dynamically Extracting Outcome-Specific Problem Lists from Clinical Notes with Guided Multi-Headed Attention

Problem lists are intended to provide clinicians with a relevant summary...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The accumulation of healthcare data today has reached unprecedented levels: NHS datasets alone record billions of patient interactions every year [nhs-report]. In particular, due to the close monitoring of patients in an Intensive Care Unit (ICU), a wealth of data is generated for each patient[Johnson2016], with some information being recorded every minute. This information remains underexploited increasing pressure on the caregivers. Finding efficient ways to use the data from Electronic Health Records (EHRs) is crucial to reduce the burden in hospitals and improve patient outcomes.

In the typical setting, an EHR contains two types of information, which are structured (e.g. blood tests, temperature, lab results) and unstructured information (e.g. nursing notes, radiology reports, discharge summaries), with the latter composing the biggest part (typically, up to 80% [Kong2019]). Both types of information are valuable for the ICU monitoring. The majority of recent research [harutyunyan2019multitask, Subudhi2021, Alves2019]

relies though on more straightforward structured information, typically being laboratory test results and vital signs. On the other hand, unstructured information is underexplored and typically two approaches are applied: raw clinical text is input into a classifier for end-to-end prediction on its basis  

[Krishnan2018, Grnarova2016, Marafino2018, Yu2020] or the text is used to extract categorical features at the first step which are then used for prediction. Those categorical features could be pre-extracted medical concepts: e.g., diseases[Cooley2021, Ye2020], symptoms[Dreisbach2019], and medications[Chen2020, Kim2020].

Among the unstructured data, the phenotype111In the medical text, the word “phenotype” refers to deviations from normal morphology, physiology, or behavior [robinson2012deep]. has been received the least attention for the ICU monitoring [Cooley2021, Ye2020]. This is mainly due to the challenge to extract the phenotypic information expressed by a variety of contextual synonyms. For example, such a phenotype as Hypotension can be expressed in text as “low blood pressure”. However, the phenotype is crucial for understanding disease diagnosis, identifying important disease-specific information, stratifying patients and identifying novel disease subtypes [aerts2006gene, deisseroth2019clinphen, Liu2019, son2018deep, Xu2020]

. There exists a range of clinical Natural Language Processing (NLP) approaches that allow to extract phenotypic information from clinical text in an unsupervised manner 

[arbabi2019ncr, Tchechmedjiev2018, deisseroth2019clinphen]. There is thus a possibility to exploit this information in an automatic way, for example, to predict mortality in septic patients [Cooley2021].

Our work thoroughly investigates the value of phenotypic information as extracted from text for ICU monitoring. We automatically extract mentions of phenotypes from clinical text using a self-supervised methodology with recent advancements in clinical NLP - contextualised word embeddings [alsentzer-etal-2019-publicly] that are particularly helpful for the detection of contextual synonyms. We extract those mentions for over 15,000 terms of the Human Phenotype Ontology (HPO) [kohler2017human]

. We enrich the phenotypic features extracted in this manner with the information coming from the structured data (i.e., bedside measurements and laboratory test results) and use both types of features for a series of non-neural and neural classification algorithms. To provide interpretation into our results we use Shapley values


, a methodology that comes from the game theory and that ranks the inputs of a system according to their importance for each individual prediction. With the help of this methodology, we are able to provide highly relevant explanations for a complete picture of a patient’s status both at the individual and cohort levels.

We benchmark our approach on the following three mainstream ICU tasks [harutyunyan2019multitask]: length of stay, in-hospital mortality and physiological decompensation. Length of stay is one of the major drivers for operative costs and resource allocation in hospitals. Understanding it in advance can be highly useful for decision-making and better resource planning. The task of in-hospital mortality prediction is important for early identification of patients at-risk and has been a focus of the research interest for quite a while, as evidenced by the existence of multiple scoring systems designed to tackle the task (e.g., APACHE[Knaus1991apache], SAPS[LeGall1993saps]). Knowing which patients are more vulnerable and prone to decompensation (physiological decompensation task) allows clinicians to better monitor and track these patients. Many warning scores methods have been developed for this task as well (e.g., MEWS[Gardner2006mews], NEWS[Smith2019news]).

Our main contributions are: (i) approach to incorporate phenotypic features into the modelling of ICU time-series prediction tasks; (ii) investigation of the importance of the phenotypic features in combination with structured information for the prediction of patient course at micro (individual patient) and macro (cohort) levels; (iii) thorough interpretability study demonstrating the importance of phenotypic features and structured information for the ICU cases; (iv) demonstration of the utility of automatic phenotyping for ICU use cases.



For our experiments, we use the publicly available ICU data from the MIMIC-III [Johnson2016mimic] database. Following [harutyunyan2019multitask], we formulate the in-hospital mortality problem as a binary classification at 48 hours after admission, in which the label indicates whether the patient dies before discharge. We formulate the problem of physiological decompensation as a binary classification, in which the target label corresponds to whether the patient will die in the next 24 hours. We cast the length of stay (LOS) prediction task as a multi-class classification problem, where the labels correspond to the remaining length of stay. Possible values are divided into 10 bins, one for the stays of less than a day, 7 bins for each day of the first week, another bin for the stays of more than a week but less than two, and the final bin for stays of more than two weeks.

For data extraction, we closely222As a difference, ICU stays without available clinical notes are excluded. follow [harutyunyan2019multitask]: from 61,533 ICU stays in MIMIC-III across 46,521 patients, we exclude hospital admissions with more than one ICU stay or transfers between ICU wards. Patients younger than 18 are excluded as well. Then, all data associated with the ICU stays is processed as follows: numerical values are mapped to a unified metric system and clinical notes are annotated for HPO phenotypes. We use our contextualised approach for phenotype extraction, as well as two others (self-)supervised annotators: Neural Concept Recognizer (NCR) [arbabi2019ncr] and ClinicalBERT [alsentzer-etal-2019-publicly]. These phenotypes are propagated forward across timesteps depending on a persistency rule manually created by an expert clinician. This rule is defined so that phenotypes that would typically last an entire ICU stay (persistent) are propagated until the end, while those that could resolve before (transient) are propagated only until the appearance of the next clinical note. 15% of all patients are reserved for testing, while the remaining 85% are further divided for a 4-fold cross validation. All splits are fixed and shared across all tasks and classification settings.

For in-hospital mortality, stays for which LOS is shorter than 48 hours or unknown are further excluded. The number of patients, ICU episodes and timesteps per task are reported in Table 1. Mortality rate across all patients is 13.12% and decompensation rate across all timesteps is 2.01%. The distribution of ICU stays per LOS class is presented in Table 2. The number of unique phenotypes annotated for the cohort is presented in Table  3. We notice our phenotyping model detects the highest number of phenotypes in two out of three tasks given its optimisation to increase recall. Overall, we extract around 1,146 (651 for ClinicalBERT, 1381 for NCR, 1405 for our phenotyping model), 1,207 (674 for ClinicalBERT, 1471 for NCR, 1475 for our phenotyping model), and 1,202 (668 for ClinicalBERT, 1471 for NCR, 1460 for our phenotyping model) unique phenotypes for in-hospital mortality, physiological decompensation, and length of stay, respectively, 30% of them are persistent (on average across tasks). Most frequent phenotypes are abnormal blood oxygen level, functional respiratory abnormality, arrhythmia, abnormality of temperature regulation, abnormal cardiovascular system physiology, and increased blood pressure.

More details on the pre-processing procedure are found in the Methods section.

Task Split Patients ICU Episodes Timesteps Labels
Positive Negative
Physiological Decompensation CV-1 5125 6215 528425 10283 518142
CV-2 5129 6134 507892 10821 497071
CV-3 5141 6264 511289 10426 500863
CV-4 5102 6297 527853 11020 516833
Test 3683 4463 367533 6931 360602
In-Hospital Mortality CV-1 2929 3382 162063 441 2941
CV-2 2917 3331 159566 466 2865
CV-3 2888 3356 160732 439 2917
CV-4 2936 3410 163284 477 2933
Test 2119 2453 117500 283 2170
Length of Stay CV-1 5151 6245 532403 Refer to Table 2
CV-2 5145 6154 510227
CV-3 5160 6286 514147
CV-4 5117 6314 530331
Test 3698 4483 369350
Table 1: Data distribution by splits. For the physiological decompensation and length of stay tasks, timesteps are taken as samples as the predictions are made every hourly timesteps, while for the in-hospital mortality task, ICU episodes are taken as samples as the predictions are made at a fixed timestep. Here, CV refers to the training Cross-Validation Folds.
Class Label Class Description (Days) CV-1 CV-2 CV-3 CV-4 Test
0 <1 131913 129634 131693 133186 95439
1 1 - 2 85311 83558 84065 85818 61372
2 2 - 3 56353 54074 54007 54780 38858
3 3 - 4 39416 37605 38106 38054 27142
4 4 - 5 29384 27982 28760 28573 20171
5 5 - 6 22830 22384 22360 22626 15878
6 6 - 7 18816 18612 18626 18582 12940
7 7 - 8 15925 15583 15697 15863 10953
8 8 - 14 62655 58512 59905 60611 40856
9 >14 69800 62283 60928 72238 45741
Total 532403 510227 514147 530331 369350
Table 2: Class distribution for Length of Stay
Tasks Annotators No. of Unique Phenotypes
In-Hospital Mortality ClinicalBERT 651
NCR 1381
Ours 1405
Physiological Decompensation ClinicalBERT 674
NCR 1471
Ours 1475
Length of Stay ClinicalBERT 668
NCR 1471
Ours 1460
Table 3: Numbers of unique phenotypes discovered by each annotator on the dataset. All three annotators are trained in a self-supervised fashion. Our phenotyping model and ClinicalBERT use contextualised embeddings and are further fine-tuned for phenotyping. NCR uses non-contextualised embeddings.

Quantitative Results

In general, we investigate the performance of three classifiers: Random Forest


(RF), Gradient Boosting


(GB), and Long Short-Term Memory Network


(LSTM). For each of them we investigate the following set of features: structured only (S) including 17 features such as blood pressure and fraction of inspired oxygen as well as their imputation masks discussed in Section 

Data preprocessing, and structured enriched with phenotypic features coming from one of the three annotators.

Our results are presented in Tables 4, 5, and 6. Results from statistical tests are presented in Table A3

. Overall, they show that phenotypic information complements positively the structured information to improve performance on all tasks (on average 2.16% for in-hospital mortality, 0.53% for decompensation, and 0.83% for LOS across annotators and classifiers). The improvements with our phenotyping model are statistically significant across tasks with the exception of physiological decompensation with RF. Improvements from NCR are not significant for physiological decompensation with RF and GB, and LOS with RF. ClinicalBERT did not produce significant improvements for in-hospital mortality with RF and LSTM, physiological decompensation with RF, and LOS with RF and GB. For the particular case of LOS prediction, this is explained by the fact that phenotypic information brings into the picture highly valuable information such as response to therapy, development of complications and unmeasured indicators of severity illness, all of which are not captured by structured information nor by severity scores based on day 1 forecasts, and are fundamental to correctly estimate LOS 

[Kramer2017los]. For in-hospital mortality, improvements are explained by the fact that phenotypic information brings into the picture the patient’s comorbidities, which are known to be essential to accurately predict the mortality risk [Forte2019].

LSTMs [Hochreiter1997lstm] consistently outperform other algorithms due to their intrinsic ability to capture temporal relations, presenting average increases of 1.41%, 0.52% and 1.51% for in-hospital mortality, physiological decompensation and LOS, respectively. Setups across all classifiers with phenotypic features as extracted by our phenotyping model have tendency to outperform other similar setups on in-hospital mortality by 0.66%, on physiological decompensation by 0.42% and on LOS by 2.67%.

Note that clinically validated scores such as APACHE-III and SAPS-II are not competitive across settings.

Across all three tasks there is no clear winner between NCR and ClinicalBERT. Each of them can outperform the other without any clear tendency. Also, we find that the mean scores from cross-validation aggregate are comparable to the respective test set scores across all the tasks for all the models.

While we conclude that phenotypic information provides indeed relevant information to correctly conduct these tasks, decision support systems in the healthcare domain should be reliable, interpretable and robust. Therefore, we accompany the above results with a thorough study on interpretability, providing explanations both at the patient and cohort levels for the observed predictions, and an assessment of robustness by studying performances across disease-specific subcohorts.

4-Fold Cross Validation Aggregate Test Set
Mean SD Mean SD
SAPS-II - 0.754 0.006 0.322 0.031 0.756 0.312
APACHE-III - 0.732 0.008 0.326 0.018 0.733 0.308
S 0.809 0.008 0.417 0.018 0.800 (0.775, 0.824) 0.339 (0.286, 0.395)
S + NCR 0.818 0.014 0.472 0.013 0.828 (0.802, 0.853) 0.467 (0.404, 0.529)
S + CB 0.803 0.011 0.423 0.004 0.812 (0.787, 0.838) 0.403 (0.345, 0.463)
S + Ours 0.829 0.008 0.458 0.017 0.833 (0.809, 0.857) 0.396 (0.340, 0.456)
S 0.785 0.046 0.419 0.024 0.793 (0.766, 0.820) 0.376 (0.318, 0.434)
S + NCR 0.811 0.015 0.433 0.025 0.830 (0.805, 0.855) 0.448 (0.386, 0.510)
S + CB 0.805 0.014 0.392 0.023 0.817 (0.791, 0.842) 0.380 (0.324, 0.441)
S + Ours 0.805 0.005 0.394 0.014 0.819 (0.794, 0.843) 0.377 (0.321, 0.437)
LSTM S [harutyunyan2019multitask] - - - - 0.825 0.410
S 0.828 0.007 0.441 0.015 0.826 (0.801, 0.848) 0.391 (0.334, 0.452)
S + NCR 0.835 0.010 0.478 0.008 0.841 (0.818, 0.864) 0.453 (0.393, 0.513)
S + CB 0.828 0.007 0.459 0.006 0.826 (0.802, 0.849) 0.414 (0.355, 0.476)
S + Ours 0.844 0.004 0.495 0.013 0.845 (0.823, 0.868) 0.464 (0.405, 0.523)
Table 4:

ICU In-Hospital Mortality Results with Area Under the Curve of Receiver Operating Characteristic (AUC-ROC) and Area Under the Curve of Precision-Recall (AUC-PR) Scores. Test set scores are shown with 95% confidence intervals. The best score for each classifier is highlighted in bold. Here, S refers to Structured, NCR to Neural Concept Recognizer

[arbabi2019ncr], CB to ClinicalBERT, and Ours to our phenotyping model.
4-Fold Cross Validation Aggregate Test Set
Mean SD Mean SD
S 0.815 0.003 0.127 0.009 0.826 (0.821, 0.831) 0.130 (0.123, 0.138)
S + NCR 0.820 0.003 0.124 0.007 0.825 (0.821, 0.830) 0.124 (0.118, 0.131)
S + CB 0.818 0.004 0.122 0.008 0.826 (0.821, 0.830) 0.125 (0.118, 0.132)
S + Ours 0.824 0.002 0.125 0.006 0.827 (0.823, 0.832) 0.126 (0.119, 0.134)
S 0.812 0.004 0.134 0.010 0.830 (0.825, 0.835) 0.154 (0.146, 0.163)
S + NCR 0.817 0.001 0.126 0.018 0.831 (0.827, 0.836) 0.124 (0.118, 0.131)
S + CB 0.816 0.004 0.122 0.023 0.834 (0.829, 0.838) 0.157 (0.149, 0.165)
S + Ours 0.823 0.005 0.131 0.014 0.838 (0.833, 0.842) 0.162 (0.154, 0.171)
S [harutyunyan2019multitask]
- - - - 0.809 0.125
S 0.819 0.002 0.135 0.016 0.824 (0.819, 0.829) 0.126 (0.119, 0.133)
S + NCR 0.820 0.003 0.133 0.012 0.834 (0.829, 0.839) 0.134 (0.127, 0.142)
S + CB 0.820 0.005 0.128 0.021 0.833 (0.828, 0.838) 0.114 (0.108, 0.119)
S + Ours 0.833 0.007 0.144 0.022 0.839 (0.834, 0.844) 0.145 (0.138, 0.153)
Table 5: ICU Decompensation Results with Area Under the Curve of Receiver Operating Characteristic (AUC-ROC) and Area Under the Curve of Precision-Recall (AUC-PR) Scores. Test set scores are shown with 95% confidence intervals. The best score for each classifier is highlighted in bold. Here, S refers to Structured, NCR to Neural Concept Recognizer[arbabi2019ncr], CB to ClinicalBERT, and Ours to our phenotyping model.
4-Fold Cross Validation Aggregate Test Set
Kappa MAD Kappa MAD
Mean SD Mean SD
S 0.381 0.004 142.0 4.6 0.390 (0.388, 0.392) 136.8 (136.2, 137.4)
S + NCR 0.382 0.007 148.0 4.1 0.390 (0.388, 0.392) 142.5 (141.9, 143.1)
S + CB 0.369 0.005 149.2 3.7 0.376 (0.374, 0.379) 144.3 (143.7, 144.9)
S + Ours 0.395 0.006 146.2 4.6 0.405 (0.403, 0.407) 140.6 (140.0, 141.2)
S 0.387 0.001 155.9 2.1 0.399 (0.397, 0.401) 151.8 (151.1, 152.4)
S + NCR 0.366 0.005 150.2 3.0 0.381 (0.378, 0.383) 146.7 (146.1, 147.3)
S + CB 0.373 0.005 148.8 2.7 0.392 (0.390, 0.394) 143.5 (143.0, 144.1)
S + Ours 0.401 0.003 142.0 3.7 0.412 (0.410, 0.414) 138.9 (138.3, 139.5)
LSTM S [harutyunyan2019multitask] 0.395 126.7
S 0.375 0.003 134.3 17.2 0.380 (0.377, 0.382) 157.0 (156.3, 157.6)
S + NCR 0.392 0.013 127.1 17.4 0.406 (0.404, 0.408) 123.3 (122.8, 123.9)
S + CB 0.374 0.015 127.6 8.6 0.388 (0.386, 0.390) 120.1 (119.6, 120.6)
S + Ours 0.416 0.011 116.1 6.9 0.430 (0.427, 0.432) 116.7 (116.1, 117.2)
Table 6: ICU Length of Stay Results with Cohen’s Kappa and Mean Absolute Deviation (MAD) Scores. Test set scores are shown with 95% confidence intervals. The best score for each classifier is highlighted in bold. Here, S refers to Structured, NCR to Neural Concept Recognizer[arbabi2019ncr], CB to ClinicalBERT, and Ours to our phenotyping model.


In this paper we work on three popular tasks in the ICU settings. Our work shows that these tasks are benefited by the inclusion of phenotypic information found inside clinical notes irrespective of the classification algorithms. Further, we show that our algorithm outperforms state-of-the-art models. In the following section, we illustrate how phenotypic information also improves reliability and provides clinicians with valuable explanations regarding the predictions.

Phenotype persistency

During modelling we have found it beneficial to separate extracted phenotypes into persistent and transient, and then to propagate them forwards in time. Each phenotype was marked by one human clinical expert based on whether it would typically persist throughout an entire ICU stay or not. Consequently, transient (e.g., fever, cough, dyspnea) and persistent (e.g., diabetes, cancer) phenotypes are propagated until the appearance of a new clinical note or the end of the ICU stay, respectively. To investigate the value of this procedure more closely, we have performed an ablation study and assessed the contribution of our phenotypic propagation methodology. Results of our experiments are in Table 7.

They show that phenotypic propagation is particularly beneficial for such models as RF and GB. Given that the intrinsic properties of those models do not allow for temporal modelling, the propagation of phenotypic information is essential. Without it, performance is comparable to that of models trained purely on structured data.

Furthermore, the LSTM model for in-hospital mortality using phenotypes without propagation performed also worse than its counterpart with propagation but better than purely structured. Considering the short time window of 48 hours inside which information is collected, this seems to indicate that our knowledge-based propagation method is more useful than the temporal relationships the LSTM network can learn in such reduced time span.

Finally, models for physiological decompensation and LOS without phenotypic propagation produced statistically superior results to those of the networks trained with such propagation. Since these two tasks exploit the entire time course of the ICU stay, LSTM models can better capture temporal relationships by themselves given the larger amount of data to learn from. While all of our transient phenotypes are propagated until the next clinical note appears, the LSTM can learn by itself per-phenotype rules, possibly propagating some phenotypes beyond the next clinical note, while discarding some others immediately after their appearance, for instance. We theorise this finer granularity for phenotypic propagation produces the observed improvements.

We conclude these findings indicate that propagating phenotypic information according to our modelling is indeed the correct way of thinking, and that further investigation focused on learning individual persistencies for phenotypes would be highly beneficial, not only to boost performance, but also to provide insights about temporal duration of phenotypes in the ICU.

4-fold Cross Validation Aggregate Test Set
Mean SD Mean SD
RF without 0.807 0.008 0.413 0.021 0.799 (0.772, 0.824) 0.351 (0.297, 0.407)
with 0.829 0.008 0.458 0.017 0.833 (0.809, 0.857) 0.396 (0.340, 0.456)
GB without 0.804 0.010 0.417 0.024 0.796 (0.768, 0.822) 0.390 (0.332, 0.355)
with 0.805 0.005 0.394 0.014 0.819 (0.794, 0.843) 0.377 (0.321, 0.437)
LSTM without 0.833 0.014 0.457 0.024 0.831 (0.807, 0.853) 0.421 (0.361, 0.483)
with 0.844 0.004 0.495 0.013 0.845 (0.823, 0.868) 0.464 (0.405, 0.523)
(a) In-hospital mortality
4-fold Cross Validation Aggregate Test Set
Mean SD Mean SD
RF without 0.812 0.002 0.125 0.007 0.820 (0.815, 0.825) 0.127 (0.120, 0.135)
with 0.824 0.002 0.125 0.006 0.827 (0.823, 0.832) 0.126 (0.119, 0.134)
GB without 0.814 0.005 0.142 0.013 0.830 (0.825, 0.835) 0.151 (0.143, 0.160)
with 0.823 0.005 0.131 0.014 0.838 (0.833, 0.842) 0.162 (0.154, 0.171)
LSTM without 0.827 0.007 0.146 0.017 0.841 (0.842, 0.851) 0.149 (0.141, 0.156)
with 0.833 0.007 0.144 0.022 0.839 (0.834, 0.844) 0.145 (0.138, 0.153)
(b) Physiological decompensation
4-fold Cross Validation Aggregate Test Set
Mean SD Mean SD
RF without 0.376 0.005 139.8 5.5 0.386 (0.380, 0.384) 135.0 (134.5, 135.6)
with 0.395 0.006 146.3 4.6 0.405 (0.403, 0.407) 140.7 (140.1, 141.3)
GB without 0.384 0.001 155.9 2.1 0.391 (0.390, 0.395) 150.0 (149.4, 150.6)
with 0.401 0.003 142.1 3.7 0.412 (0.410, 0.414) 138.9 (138.3, 139.5)
LSTM without 0.427 0.007 118.3 4.2 0.441 (0.439, 0.440) 111.4 (110.9, 111.9)
with 0.416 0.012 116.2 6.9 0.430 (0.427, 0.432) 116.7 (116.2, 117.3)
(c) Length of Stay
Table 7: Results of ablation study on our phenotyping model to assess the importance of phenotypic modelling. Models without phenotypic propagation encounter high sparsity of phenotypes as data is only available at the timestep the clinical note is written. Models with phenotypic propagation observe phenotypes throughout all timesteps. The best score for each classifier is highlighted in bold.

Phenotype importance

To further understand the contribution of phenotypic features to the prediction performance, we have studied the most important features with the help of Shapley values[Lundberg2017original]. This analysis and all involving Shapley values are done on the Gradient Boosting (GB) models. An illustration of our investigation is in Figure 1, where we present the top predicting features for in-hospital mortality and physiological decompensation. It confirms our previous observations over automatic metrics that phenotypic features are particularly helpful for the in-hospital mortality prediction, given that 14 out of the 20 most important features are phenotypes. This aligns with our expectations, and is explained by the fact that this task has no fixed point in the future at which to make predictions. In other words, since the prediction should comprehend the entire stay rather than predict whether the patient will die within the next 12 hours, forecasts need to rely on data that is highly informative but also that is able to provide insights accurately into the long-term future. Phenotypes, contrary to bedside measurements which can lose temporal correlation due to their dynamic nature, fulfill very well these criteria as they can capture, for instance, comorbidities, which are essential for predicting mortality [Forte2019] and constitute the entire input for clinically validated models to predict mortality [Charlson1987, Elixhauser1998]. Furthermore, another study including 230,000 ICU patients found that combining the comorbidities with acute physiological measurements yielded the best results, outperforming all mortality scores (APACHE II, SAPS II) [Nielsen2019].

The top ranking feature for mortality prediction is, unexpectedly, whether the patient experiences pain or not. Although not decisive, there is some initial evidence corroborating the fact that pain management improves outcomes in the ICU [Georgiou2015, Hasegawa2017]. However, pain could also be interpreted as a proxy for establishing a high level of consciousness, which has been correlated with better outcomes in the ICU [Bastos1993]. Other top ranking phenotypes included tachycardia, atrial arrhythmia, constitutional symptom, nausea and vomiting, abnormal pupillary function, abnormal pattern of respiration, abnormal prothrombin time, confusion, increased inflammatory response, abnormal respiratory system morphology, diarrhea, and dyspnea; that together cover most of the body systems (i.e., heart, lungs, GI tract, central nervous system, coagulation, infection, kidneys) which are typically assessed through clinically validated scores, including early warning scores (e.g., NEWS, MEWS) and mortality scores (e.g., Charlson Comorbodity Index[Charlson1987], Elixhauser Index[Elixhauser1998], APACHE, SAPS).

(a) Physiological Decompensation
(b) In-Hospital Mortality
Figure 1: Top features for (a) physiological decompensation and (b) in-hospital mortality. Features are sorted in decreasing importance according to their mean absolute Shapley values. Each row presents a condensed summary of the feature’s impact on the prediction. Each data sample is represented as a single dot in each row, and its colour on a particular row represents the value of that sample for that feature, with blue corresponding to lower values or absence, and red to higher ones or presence. The SHAP value (horizontal position of a dot) measures the contribution of that feature on a sample, towards the prediction (right corresponding to mortality or decompensation, and left to survival or out of decompensation risk). For instance, in (b) top row, since the vertical axes clearly splits patients by colour, manifesting HP:0012531 Pain consistently leads to lower chances of dying.

Our study also showed that while phenotypic features are not as important for decompensation as for in-hospital mortality, they are still useful because they provide a better estimation of the predicted risk. Given that this task concerns predicting mortality within the next 24 hours, bedside measurements become more informative thanks to their temporal correlation. This is reflected by our system due to the fact that only 4 out of the top 20 features for this task were phenotypic ones. Nevertheless, bedside measurements can be ambiguous or provide an incomplete picture of the patient’s status without the data found in clinical notes. Thus, these notes can give a baseline on which the prediction can be based. This situation is illustrated by Figure 2, where the top ranking feature is a phenotype. Even though this phenotype is marked as persistent and thus remains present throughout the ICU stay, it increases appropriately the risk of decompensation, giving overall a better estimation. This fact is further confirmed by the calibration plot for decompensation (Figure 3(a)), evidencing that the inclusion of phenotypic information leads to consistently more calibrated estimates, and by Figure 2(a), where it is observed that predictions for physiological decompensation benefit the most when the patient stays in the ICU for more than one week.

Similarly, the top features for length-of-stay are presented in Figure A2.

At our next step, we have investigated the calibration curves to assess how phenotypic information contributes to the reliability of the models.

Figure 2:

Time course of the physiological decompensation prediction for an illustrative patient in the test set. The top plot represents the time series of the prediction in probability (0 for no risk of decompensation, 1 for decompensation). The heatmap illustrates how the contribution of each feature (i.e., each row) varies across time for this subject. Features are sorted in decreasing order according to their importance for this patient, represented by the black horizontal bar at the right of each row. The colour of a row indicates how that feature contributes to the prediction at a moment in time, with red representing a positive contribution (i.e., that the patient will decompensate), and blue for a negative contribution. For this patient, although fluctuations in the prediction come from changes in structured data, taking into account the neoplasm of the respiratory system allows to better estimate the baseline probability of a long stay.

(a) Physiological Decompensation
(b) In-hospital Mortality
Figure 3: AUCROC for (a) physiological decompensation and (b) in-hospital mortality for LSTM for patients with different LOS values. While the in-hospital mortality task benefits consistently for any duration of the ICU stay, decompensation sees the best improvements when patients stay the longest. This behaviour is a natural consequence of the fact that while near future forecasts can rely strongly on bedside measurements, forecasting without a fixed endpoint in time is significantly more difficult. Nevertheless, patients who stayed for more than two weeks still saw a benefit when introducing phenotypic features, as they calibrate better the algorithm’s prediction. Here, S represents Structured and Ours refers to our phenotyping model.


Our investigation of the respective calibration curves (for LSTMs, the results are in Figure 4) shows that unstructured information improves model calibration, i.e. its correlation with actual probabilities. To summarise each curve we use the Brier score[Brier1950], namely the mean squared error between the observed curve and the ideal one (i.e., the identity line).

Our major observation is that phenotypic features improve calibration across setups, this is especially true for the features extracted with our methodology for physiological decompensation and in-hospital mortality. LSTMs with our features are the settings producing the lowest Brier score.

We have also noticed that physiological decompensation models are overall better calibrated, which is expected given that physiological decompensation has a near-future prediction scope (i.e., next 24 hours) compared to in-hospital mortality having no fixed time window.

Note that APACHE and SAPS (clinically validated mortality scores) are worse or only comparable to our weakest settings (i.e., only structured data).

Results for the other classifiers are presented in Figures A3 and A4. Overall, GBs exhibit similar calibration results for physiological decompensation and in-hospital mortality. Calibration with RFs for in-hospital mortality is also comparable. However, RFs for physiological decompensation present significantly worse calibrations.

(a) Physiological Decompensation
(b) In-hospital Mortality
Figure 4:

Calibration curves for (a) physiological decompensation and (b) in-hospital mortality. Even when a system can accurately predict a label after thresholding in a classification problem, knowing the probability with which such system assigned a class provides further insights and interpretation. Calibration plots illustrate how interpretable such probabilities can be. Each curve is presented with its Brier score (the lower the better). Note that overall inclusion of unstructured data helps with calibration. Further, for both tasks, unstructured data annotated by our phenotyping model provides the best results. LSTM: Structured features with Long Short-Term Memory Neural Network. Ours, NCR, CB: phenotypic features from our phenotyping model, NCR and ClinicalBERT, respectively.

Prognosis analysis

Beyond producing clinically relevant explanations at the cohort level, with the help of Shapley values we can shed light onto a patient’s journey and discover retrospectively when the patient was the most vulnerable and why. For example, the fragment of a patient’s LOS forecast in Figure 5 illustrates an estimated probability, after the 41st hour from admission, of a LOS longer than 14 days being of 69%, mainly because the patient scored 1 in the Glasgow Coma Scale Verbal Response. One hour after, when a clinical note becomes available, worrisome phenotypes appear (including edema, hypotension and abnormality of the respiratory system). Consequently, the estimated probability increased to 88%. While hypotension is a concept the classifier can learn from the blood pressure measurements, edema and abnormality of the respiratory system are not reflected on typical acute measurements, all of which indicate the usefulness of phenotypic information on these tasks.

Figure 5: Illustrative case for an ICU length of stay of more than 14 days. Top plot: time course of the normalised predicted probability for a stay of 14+ days, and feature heatmap for a representative segment of the ICU stay. Each row of the heatmap represents one of the top features. At each time step, a feature can contribute positively (red) or negatively (blue) for predicting a stay of 14 days or more. Black horizontal bars at the right of each row represent the importance of the features. Note that a new clinical note is made available (at the 42nd hour since admission, vertical dashed line), new relevant information is also available and thus the confidence of predicting a longer stay grows congruently (observed as an abrupt increase of the predicted probability). Bottom plot: inspection of the contributing features at the 41st and 42nd hours. Each force plot illustrates features contributing positively (in red) and negatively (in blue) to the prediction of a longer stay. Probability of long stay increases from 69% to 88% when a clinical note is written, given that it provides critical information for the decision-making process (including Edema, Hypotension and Abnormality of the respiratory system).

Cohort study

Finally, we measure the generalisability of the systems to different cohorts of the patients. We assess the system’s performance when tested against specific patients’ cohorts, such as Diabetes, Cancer, Cardiovascular Diseases, and Depression. We split our test set in these disease-specific cohorts and compute the scores using our best classifiers based on LSTM trained with both structured and unstructured data obtained by our phenotypes extraction methodology. Results are shown in Table 8 for all the tasks. They show that our systems are robust to different cohorts of patients, including under-represented ones.

On the physiological decompensation task, we find all the cohorts show similar performance though all the scores are lower than the complete test set. So, our model appears to generalise well irrespective of the cohorts. Similarly on the in-hospital mortality task, the model generalises to all the cohorts with Cardiovascular Diseases and Depression getting lowest scores among all the cohorts.

Contrary, on the LOS task, it generalises to largest cohorts, i.e., Cardiovascular Diseases and Diabetes, well but the smallest cohorts, i.e., Cancer and Depression, show poor generalisability. It seems that for the physiological decompensation and in-hospital mortality tasks, nature of diseases affect the performance more than the sample size of the test set. For LOS task, smallest cohorts lag the performance which can be related to their sample size.

Cohort No. of Patients No. of ICU Episodes Kappa
All 3698 4483 0.430
Cardiovascular Diseases 980 1202 0.413
Diabetes 930 1195 0.424
Cancer 493 572 0.321
Depression 216 241 0.330
Length of Stay prediction using Cohen’s Kappa scores on different cohorts.
Cohort No. of Patients No. of ICU Episodes AUC-ROC
All 2119 2453 0.845
Cardiovascular Diseases 681 789 0.780
Diabetes 563 682 0.826
Cancer 277 304 0.822
Depression 119 122 0.783
In-hospital Mortality prediction using Area Under the Curve of Receiver Operating Characteristic (AUC-ROC) scores on different cohorts.
Cohort No. of Patients No. of ICU Episodes AUC-ROC
All 3683 4463 0.839
Cardiovascular Diseases 975 1197 0.792
Diabetes 927 1191 0.808
Cancer 489 565 0.806
Depression 216 240 0.820
Physiological decompensation prediction using Area Under the Curve of Receiver Operating Characteristic (AUC-ROC) scores on different cohorts.
Table 8: Cohort Study: Analysing the generalisability of our best model, i.e., LSTM using both structured and unstructured data extracted using our phenotypes extraction methodology, on different cohorts of the patients.


There are several directions to take our work further. We have investigated only one data source and our observations are to be confirmed with other data sources.

Finally, due to technical limitations, our explanations were produced on the Gradient Boosting, which, while being still superior to the baseline built with structured data only, is not our strongest model. Explanations from neural network based models are to be investigated in the future studies.

Finally, there are certain limitations to our modelling approach. While we introduce a new modelling approach for phenotypic features, the ones annotated as transient are made present only until a new clinical note appears in the timeline. This has the inconvenience that phenotypes might be prematurely considered as not present because the next available clinical note did not mention them. Even though the LSTM classifier is able to learn temporal correlations on its own, a more elaborated feature modelling could prove useful.


Data preprocessing

We extracted the data following the practices of the benchmark [harutyunyan2019multitask]. We follow their filtering criteria for the patients, admissions and ICU stays for all the 3 tasks. For in-hospital mortality, stays for which LOS is shorter than 48 hours or unknown are further excluded. In addition to the patients filtering criteria, we discard all the ICU episodes in which a clinical note is not recorded. This reduces our train and test data as compared to the benchmark, so we recalculate the baseline scores using their code on our new test set. These scores are reported in the first row of LSTM results in Tables 4, 5, and 6.

We follow the preprocessing333

of the benchmark for the structured data. We impute all the missing values for the structured features with the most recent observation. In case it is not available, we use the normal reference value. We also create masks to differentiate imputed values from the observed ones. We group multiple observations in an hourly timestep by taking the most recent value. Further, we add the missing timesteps with the same imputation strategy. Then, we scale all the features to zero mean and unit variance.

For the unstructured data, we have high data sparsity as clinical notes are recorded infrequently. Roughly, they are recorded every 12 hours against the structured data which is recorded very frequently. We devise an imputation strategy to handle this issue. We divide the phenotypes into persistent and transient by the annotations of an expert clinician. If a phenotype was clinically deemed likely to last an entire admission in the vast majority of typical cases, it was marked as ‘persistent’. These phenotypes are generally chronic in nature which reflects a patient’s medical background such as cancer and tuberculosis. These conditions are unlikely to change throughout the patient’s ICU stay. All temporary phenotypes and those that generally last long, but which might resolve in some cases, are marked ‘transient’ to prevent a resolved phenotype from being carried forward too far. The latter phenotypes are temporary and can be acquired or improved during an ICU stay such as pain, fever, cough, etc. In practice, this means that a persistent phenotype is made available throughout the entire stay starting from the point it firstly appears. For a transient phenotype, we make it present from the moment it appears until a new clinical note appears. We replace all the phenotypes with their immediate parent using the Human Phenotype Ontology (HPO) [Kohler2021hpo] tree (unless the parent is “Phenotypic abnormality” which is the root node of phenotypes in HPO) to reduce the overall feature dimensionality. Overall, this produces a reduction of 40-60% in the number of features.

Algorithm development and analysis

To extract phenotypes from free-text clinical notes, we use our phenotyping model, which is a BERT-based system, trained firstly in a self-supervised fashion and then fine-tuned on manually annotated data with main focus on detecting contextual synonyms and increasing recall. Additionally, we also use ClinicalBERT [alsentzer-etal-2019-publicly] and NCR [arbabi2019ncr]

for comparison. NCR uses a convolutional neural network (CNN) to assign similarity scores to HPO concepts of phrases encoded by using pre-trained non-contextualised word embeddings. Our phenotyping model, instead, uses a BERT-based network to assign similarity scores between the contextualised word embeddings and pre-trained embeddings from a knowledge graph based on the HPO hierarchy. Our phenotyping model is further described in the Appendix.

Along with our phenotyping model, we use standard Machine Learning classifiers - Random Forest 

[Breiman2001], Gradient Boosting [Chen2016xgboost], and LSTM [Hochreiter1997lstm]. We do not use any explicit feature engineering besides distiguishing the phenotypic features into persistent and transient with any of the models.

Evaluation Metrics

We use Area Under the Curve of Receiver Operating Characteristic[Lasko2005aucroc] (AUC-ROC) and Area Under the Curve of Precision-Recall (AUC-PR) for In-Hospital Mortality and Physiological Decompensation tasks. We primarily rely on AUC-ROC for statistical analysis as it is threshold independent and used by the benchmark [harutyunyan2019multitask] as the primary metric. For the Length of Stay task, we use Cohen’s Kappa [Cohen1960kappa] and Mean Absolute Deviation [PhamGia2001mad] (MAD) with primarily relying on the Kappa scores for statistical analysis.

Explanation of predictions

To provide insights into our predictions we used Shapley values[strumbelj2014shapley], which are explained more in details in Appendix. These values are typically used to explain black box models, and allow us to quantify the importance of a feature and whether it impacts positively or negatively the outcome.

Model evaluation and statistical analysis

We use a train-test split based on the benchmark[harutyunyan2019multitask], but exclude patients without clinical notes, resulting in 21,346 and 3,824 patients for train and test set, respectively. Further, we perform 4-fold cross validation on the training set (5,344, 5,324, 5,357 and 5,321 patients for each fold). All splits are deterministic, so that all the classifiers with different data settings are trained and evaluated with the same subsets of data. We use the bootstrap resampling following the benchmark for statistical analysis of the scores. To compute confidence intervals on the test set we resample it 1000 times for length of stay and decompensation and 10000 times for in-hospital mortality task. Then, we compute the scores on the resampled data to calculate 95% confidence intervals.

Data availability

The MIMIC-III database is available on PhysioNet repository [physionet2020mimiciii]. The code to generate the three ICU benchmarks from MIMIC-III is published by the study [harutyunyan2019multitask].



We would like to thank Dr. Rick Sax, Dr. Garima Gupta and Dr. Matt Wiener for their feedback throughout this research. We would also like to thank Dr. Garima Gupta, Dr. Deepa (M.R.S.H) and Dr. Ashok (M.S.) for helping us create gold-standard phenotype annotation data.

Author contributions

JZ, LB, AT and JI conceived the experiments. LB and AT conducted the experiments. JZ, LB, AT, AS and JI analysed the results. JI, VB and YG reviewed the research and manuscript. All authors approved the manuscript.

Additional information

Competing interests: Vibhor Gupta and Yike Guo are founders of Pangaea Data Limited. Julia Ive is a consulting advisor to Pangaea Data Limited.


Our self-supervised contextualised phenotyping model

We represent tokens using contextualised embeddings by state-of-the-art pre-trained ClinicalBERT [alsentzer-etal-2019-publicly]. Those embeddings are pre-trained using the MIMIC-III data, on the top of the BERT [devlin-etal-2019-bert] and BioBERT [biobert] embeddings in-turn which are pre-trained using text corpora in general domain and biomedical domain respectively. BERT is based on the Transformer architecture [vasvani:nips:2018]

, current state-of-the-art in language understanding and text generation, which allows to model long range contextual dependencies (dependencies between words separated from each other by long unrelated text spans). In addition to the adoption of the Transformer architecture, the BERT novelty comes from the pre-training objective (so called “masked language modelling (MLM)”). This pre-training is fully unsupervised (or namely self-supervised) and consists in predicting the masked words in text by relying on their left and right contexts.

As shown in Figure A1 (Left), our phenotyping model consists of a Transformer Encoder which is identical to and initialised by the ClinicalBERT. Besides, there are two additional classifiers on the top of the Transformer Encoder which predict if a token is HPO-relevant and assigns HPO concepts to those HPO-relevant tokens.

Figure A1: Left: The phenotyping model consists of one Transformer Encoder and two additional classifiers. The model is trained using the three self-supervised objectives (1, 2 and 3). Middle and right: Prior to the training of the phenotyping model, the other model for HPO embeddings is trained with relational and semantic losses (4, 5). Both Transformer Encoders are initialised with Clinical BERT but trained separately.

We enrich the model with the following three training objectives: (1) Standard binary cross-entropy loss objective to predict , where represents a token in text and is 1 if is HPO-relevant, otherwise 0:


(2) Standard cross-entropy loss objective with softmax to predict , where is defined over the most frequent 400 HPO concepts found in the training data.


(3) Euclidean distance between the embedding of and the respective HPO concept embedding if is an infrequent HPO concept.


The intuition behind to increase precision in the resulting performance, while the loss objective is defined to increase recall of the resulting model and targets the detection of the rare and granular HPO concepts. The good performance can be reached by using both objectives, otherwise, for example, recall would suffer.

Additionally we follow the pre-training objective masked language model (MLM) by BERT [devlin-etal-2019-bert], which helps to maintain the semantics of actual words in the created representations.

Prior to the training of the proposed model, we also build the knowledge graph embeddings for HPO concepts based on the Transformer architecture. The model is designed to encode both the hierarchical connections between HPO concepts and the semantics in definitions of HPO concepts, so that the similar HPO concepts have similar embeddings and vice versa. We consider two learning objectives while training this model. The first learning objective is to encourage the neighbouring HPO concepts to have similar embeddings and non-neighbouring HPO concepts to have different embeddings.


where , is a neighbouring HPO concept of and is a set of non-neighbouring HPO concepts of . Second, we adopt the skip-gram negative sampling model [mikolov:nips:2013] to predict the appearance of tokens in HPO’s definitions.


In practice, the Transformer Encoder is also identical to and initialised by ClinicalBERT. The HPO embeddings are trained and obtained prior to the training of the proposed self-supervised model.

Training Data

In the self-supervised pre-traing stage, we use 38,772 notes of brief hospital course in MIMIC-III’s discharge summaries and 1.5M generated notes by using data augmentation (see details below) which is also trained on MIMIC-III. We build our training examples without any additional annotation by humans by considering HPO names, synonyms, abbreviations, as well as respective concepts from Unified Medical Language System (UMLS) [Bodenreider2004] as provided by the ontology related to an HPO concept matched with exact match in clinical text as training samples for our proposed self-supervised model.

In the fine-tuning stage, we collect gold-standard phenotype annotation data with the help of three expert clinicians and finetune the pre-trained phenotyping model on the gold data. The EHRs were pre-annotated with HPO concepts by keyword matching, and then the annotations were corrected by the three clinicians with consensus. We have created our own sub-corpus of 242 discharge summaries from MIMIC-III with gold annotations. We used 146, 48, 48 of them for fine-tuning, validation and testing.


At the inference stage of phenotype annotation, we hypothesise and can detect the HPO-relevant spans of frequent HPO concepts with high precision, while those of rare HPO concepts can be found by the Euclidean distance between contextualised token embedding and HPO embedding with good recall. More precisely, we formalise the decision strategy as Algorithm 1.

is the input sequence; stands for the full set of HPO concepts, includes most frequent HPO concepts and a special notation represents rare HPO concepts; Initialise thresholds for and Euclidean distance respectively with pre-defined values; for  in  do
       if  then
             where ;
      if  then
             where ; if  then
return .
Algorithm 1 The decision strategy of inferring phenotype annotation.

Data Augmentation Techniques

In the self-supervised pre-training stage, there are two issues related to the creation of the training data by keyword matching: (1) this data can be too limited to help the model capture contextual synonyms of phenotypes; (2) rare HPO concepts may not be found in the clinical text used for training and the model will not be able to detect them at the inference stage. We address those two problems by data augmentation which creates textual variants for existing HPO-relevant spans by paraphrasing and uses a generative model to generate context around rare HPO concepts.

Implementation Details

The data processing and model are developed by Python 3.7. Besides our own code, we use open-sourced third-party libraries including Matplotlib, Numpy, Pandas, Pronto, Scikit-learn, Transformers, Tensorboard, Pytorch, Tqdm, Xmltodict and their dependencies.


In the self-supervised setting, we report the performance of baselines (1) Keyword: a naive method that simply matches HPO names, synonyms and abbreviations to text spans, and (2) a variety of text mining models (Clinphen [deisseroth2019clinphen], NCBO [Tchechmedjiev2018], cTAKES [Savova2010], MetaMap [aronson2006metamap], MetaMapLite [demner2017metamaplite]

), and (3) two deep learning models (NCR

[arbabi2019ncr], MedCAT [kraljevic2019medcat]) which are trained without supervision.

In the fine-tuned setting, we use (1) BERT-Base [devlin-etal-2019-bert], (2) BioBERT-Base v1.0 [biobert] pre-trained on PubMed and PMC, (3) Clinical BERT [alsentzer-etal-2019-publicly] pre-trained based on BioBERT and MIMIC-III discharge summaries, (4) SciBERT [beltagy-etal-2019-scibert] pre-trained for scientific literature.

We use micro-averaged Precision, Recall and F1-score at the document level and the HPO annotations are de-duplicated. Following the best practices by  [Liu2019], we report metrics (1) Exact match: only the exact same HPO annotations were counted as correct. (2) Generalised match: both the predicted and target HPO annotations are first extended to include all ancestors in HPO up until Phenotypic Abnormality (HP:0000118) (exclusive).

Our primary observation is that our phenotyping model outperforms all the baselines with respect to F1 and recall scores in both settings for both the exact and generalised matches.

Method Exact Match Generalised Match
Precision Recall F1 Precision Recall F1
Keyword 0.7496 0.5223 0.6156 0.7883 0.6587 0.7177
NCR 0.7747 0.4851 0.5967 0.8778 0.5917 0.7069
Clinphen 0.8147 0.3148 0.4541 0.9337 0.4066 0.5665
NCBO 0.9186 0.3901 0.5477 0.9632 0.4711 0.6328
cTAKES 0.8250 0.3259 0.4673 0.9321 0.4287 0.5872
MetaMap 0.7909 0.4062 0.5367 0.8835 0.5384 0.6691
MetaMapLite 0.7968 0.4358 0.5634 0.8766 0.5720 0.6923
MedCAT (Medmentions) 0.7290 0.3321 0.4563 0.8305 0.4711 0.6012
MedCAT (UMLS) 0.8630 0.3889 0.5362 0.9311 0.5231 0.6699
Ours (E) 0.7334 0.5619 0.6363 0.7706 0.6972 0.7320
Ours (E) w. Augmented Data 0.7235 0.5556 0.6285 0.7741 0.6997 0.7351
Self-supervised setting.
Method Exact Match Generalised Match
Precision Recall F1 Precision Recall F1
BERT 0.7132 0.5617 0.6285 0.8434 0.6844 0.7557
BioBERT 0.6864 0.5728 0.6245 0.8335 0.7021 0.7622
ClinicalBERT 0.6962 0.5630 0.6225 0.8429 0.6980 0.7637
SciBERT 0.6898 0.5407 0.6062 0.8269 0.6671 0.7385
Ours (E) w. Augmented Data 0.7141 0.7123 0.7132 0.8463 0.8380 0.8421
Fine-tuned setting.
Table A1: Results of phenotype annotation on the gold-standard test set in the self-supervised and fine-tuned setting respectively. NCR and ClinicalBERT overall achieve best F1 scores.

Shapley Values

Shapley values come from game theory and are used to estimate the impact of a feature on a system’s output. Feature impact is defined as the variation in the output of the model when the feature is observed versus when it is unknown.

Shapley values belong to a category of methods denominated additive. In particular, the additivity is formulated as

where is the prediction made by the model, are the features fed to the model, is the number of features, is the Shapley value of the i-th feature, and is the expected value of the model over the training dataset. Also, this assumption ensures the values correctly reflect the difference between the expected model output and the output for a particular prediction.

The Shapley value of a feature is computed via


where is a subset of all input features, and with in a subset of the input features with only those belonging to present.

In this study we used the SHAP library [Lundberg2017original] and its optimisation for tree-based classifiers [Lundberg2020tree] to compute the Shapley values.

Algorithm Hyperparameters

Classifier Hyperparameters
Random Forest num of estimators=300, criterion="gini", max depth=None, min samples split=2, min samples leaf=1
Gradient Boosting learning rate=0.3, min split loss=0, max depth=6, min child weight=1, subsample=1
LSTM epochs=30, hidden size=128, batch size=8, num of layers=1, patience=10, dropout rate=0, learning rate=1e-4, weight decay=0.0
Table A2: Hyperparameters for classifiers

Feature importance for Length-of-Stay

Figure A2: Top features for length-of-stay predicting stays of more than 1 week.

Significance tests

Secondary Models
S +
S +
S +
S - 81.4 69 13.3
S + NCR 18.6 - 32.4 0.6
S + CB 31 67.6 - 2.8
S + Ours 86.7 99.4 97.2 -
S - 9.5 0 0
S + NCR 90.5 - 6.1 0
S + CB 100 93.9 - 0.1
S + Ours 100 100 99.9 -
LSTM S - 0 0 0
S + NCR 100 - 73 0
S + CB 100 27 - 0
S + Ours 100 100 100 -
Physiological Decompensation
Secondary Models
S +
S +
S +
S - 22.1 100 0
S + NCR 77.9 - 100 0
S + CB 0 0 - 0
S + Ours 100 100 100 -
S - 100 100 0
S + NCR 0 - 0 0
S + CB 0 100 - 0
S + Ours 100 100 100 -
LSTM S - 0 0 0
S + NCR 100 - 100 0
S + CB 100 0 - 0
S + Ours 100 100 100 -
Length of Stay
Secondary Models
S +
S +
S +
S - 1 13.25 0.13
S + NCR 99 - 94.82 26.13
S + CB 86.75 5.18 - 1.07
S + Ours 99.87 73.87 98.93 -
S - 0.6 1.78 2.26
S + NCR 99.4 - 88.39 84.43
S + CB 98.22 11.61 - 44.07
S + Ours 97.74 15.57 55.93 -
LSTM S - 0 0 0
S + NCR 100 - 100 0
S + CB 100 0 - 0
S + Ours 100 100 100 -
In-Hospital Mortality
Table A3: Statistical Significance Matrix with Bootstrap Resampling. All the scores are percentages of samples where the base model performs better than the secondary model. Each sample is built by resampling the original test set and then scoring the base/secondary model on it. For example, the last row in (c) shows the base model (LSTM with S + Ours) is better than the secondary models (LSTM with S or S + NCR or S + CB) on 100% samples (i.e. with statistical significance). Here, S refers to Structured, NCR to Neural Concept Recognizer[arbabi2019ncr], CB to ClinicalBERT, and Ours to our phenotyping model.

Calibration curves

(a) Physiological Decompensation
(b) In-hospital Mortality
Figure A3: Calibration curves with Gradient Boosting for (a) physiological decompensation and (b) in-hospital mortality.
(a) Physiological Decompensation
(b) In-hospital Mortality
Figure A4: Calibration curves with Random Forest for (a) physiological decompensation and (b) in-hospital mortality.