ISeeU2: Visually Interpretable ICU mortality prediction using deep learning and free-text medical notes

05/19/2020 ∙ by William Caicedo-Torres, et al. ∙ Auckland University of Technology 0

Accurate mortality prediction allows Intensive Care Units (ICUs) to adequately benchmark clinical practice and identify patients with unexpected outcomes. Traditionally, simple statistical models have been used to assess patient death risk, many times with sub-optimal performance. On the other hand deep learning holds promise to positively impact clinical practice by leveraging medical data to assist diagnosis and prediction, including mortality prediction. However, as the question of whether powerful Deep Learning models attend correlations backed by sound medical knowledge when generating predictions remains open, additional interpretability tools are needed to foster trust and encourage the use of AI by clinicians. In this work we show a Deep Learning model trained on MIMIC-III to predict mortality using raw nursing notes, together with visual explanations for word importance. Our model reaches a ROC of 0.8629 (+/-0.0058), outperforming the traditional SAPS-II score and providing enhanced interpretability when compared with similar Deep Learning approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intensive Care Units (ICUs) are the last line of defense against critical conditions that require constant monitoring and advanced medical support. Their importance has been highlighted in recent times, when ICUs around the world have been overrun by the COVID-19 pandemic Grasselli et al. (2020); Emanuel et al. (2020). It is in times like these when research into ways to adequately manage scarce critical care resources must be even more vigorously pursued, in order to offer additional tools that support medical decisions and allow for the effective benchmark of clinical practice.

The issue of mortality prediction in the ICU has been approached from a statistical standpoint by means of risk prediction models like APACHE, SAPS, MODS, among others Rapsang and Shyam (2014)

. These models use a set of physiological predictors, demographic factors, and the occurrence of certain chronic conditions, to estimate a score that serves as a proxy for the likelihood of death of ICU patients. Because of the relatively straightforward way of interpreting results, simple statistical approaches such as logistic regression are the go-to modeling techniques used to estimate mortality probability and the importance of the predictors involved. On the other hand, the simplicity of the models also mean that their limited expressiveness may not accurately represent the possibly non-linear dynamics of mortality prediction. Given this, high-capacity machine learning models might be useful to increase predictive performance. Concretely, the relevant literature shows that the use of deep learning models trained on physiological time-series data can outperform these previously mentioned statistical models

Purushotham et al. (2018); Caicedo-Torres and Gutierrez (2019).

One of the advantages of deep learning over other techniques is its ability to use multiple modes of data to train predictive models. In the biomedical domain, health records, images, and time-series data, have been used for different tasks with success Shen et al. (2017); Shickel et al. (2018)

. This advantage is relevant for mortality prediction (and for many other clinical tasks as well), as a substantial amount of data is generated inside ICUs as free-text notes which can be used as input to create Natural Language Processing (NLP) predictive models. The nature of NLP poses some challenges for which deep learning is uniquely suited via its ability to deal with high-dimensional data and its elegant way to take temporal and spatial patterns into account. Some works have used deep neural networks and free-text to predict mortality

Grnarova et al. (2016) and length of stay (among others), showing that there is interesting potential for this type of models.

On the other hand, a particularly important downside of deep learning is that, compared to the simpler logistic regression based models, feature importance is not as readily available. This in turn makes these models hard to interpret, as internally the model may transform the original input features to high-dimensional spaces via non-linear transformations, making it hard to establish the impact of each predictor on the predicted outcome. It has been documented that given their large predictive capacity, deep learning models can easily fit spurious correlations in the datasets used for their training, leading to potential diagnostic issues

Cooper et al. (1997). However some work has been done to interpret deep learning models in order to offer explanations intended to foster trust and further encourage their usage in the critical care setting. For instance, in our previous work we developed an interpretable deep learning mortality prediction model that uses physiological time-series data from the first 48 hours of patient ICU stay Caicedo-Torres and Gutierrez (2019).

In this work, we present ISeeU2, a deep learning model that uses free-text medical notes from the first 48 hours of stay to predict patient mortality in the ICU. We use the MIMIC-III database Johnson et al. (2016)

to train a convolutional neural network (ConvNet) that is able to use raw nursing notes with minimal pre-processing to efficiently generate a prediction, and we couple the prediction of mortality with word importance and sentence importance visualizations, in a way that annotates the original medical note to show what parts of it are more predictive for death or survival, according to the model.

1.1 Related work

In the past some works have used deep learning to predict ICU mortality using free text. Grnarova et al Grnarova et al. (2016)

proposed the use of a convolutional neural network for ICU mortality prediction using free-text medical notes from MIMIC-III. They used all medical notes from each patient stay to predict mortality, and trained their model using a custom loss function that included a cross-entropy term involving mortality prediction at the sentence as well, with promising results. Jo et al

Jo et al. (2017)

used a hybrid Latent Dirichlet Allocation (LDA) + Long Short Term Memory (LSTM) model for ICU mortality prediction trained on medical notes from MIMIC-III, in which the LSTM used the topic LDA features as input. Suchil et al

Sushil et al. (2018)

used stacked denoising autoencoders to create patient representations out of medical free-text notes, to be used for downstream tasks as mortality prediction. Si et al

Si and Roberts (2019) proposed the use of a ConvNet for multitask prediction (mortality, length of stay), using all available patient medical notes up until time of discharge. Jin et al Jin et al. (2018)

proposed a multimodal neural network architecture and a Named Entity Recognition (NER) text pre-processing pipeline to predict in-hospital ICU mortality using all available types of free-text notes and a set of vital signs and lab results from the first 48 hours of patient stay, extracted from MIMIC-III.

Most of these works include some ad-hoc interpretability mechanism: Grnarova et al Grnarova et al. (2016) included a sentence-based mortality prediction target which is then used to score individual words according to their associated predicted mortality probability, Jo et al Jo et al. (2017) used LDA-computed weights to provide word importance, Suchil et al Sushil et al. (2018) used a gradient-based interpretability approach to compute the importance of words in the input notes.

Our work has key differences relative to those from the related literature. As opposed to Grnarova et al. (2016); Si and Roberts (2019)

, we only use notes from the first 48 hours of patient stay instead of all notes available up until the time of discharge/death, and as opposed to citejin2018improving we only use nursing notes and not the whole spectrum of notes available in MIMIC-III. Also from an interpretability standpoint we rely on a theoretically sound concept from coalitional game theory, known as the Shapley Value

Shapley (1953)

, instead of explainability heuristics. Finally our visualization approach puts emphasis on presenting results in a way that can be easily understood and it is useful for users.

The contributions of our work are summarized in the following:

  • We present a model that is able to offer performance comparable to state of the art models that use physiological time series data, but only using raw nursing notes extracted from MIMIC-III.

  • Our approach only uses data from the first 48 hours of patient stay, instead of using data from the entirety of the stay. That makes our model more usable in a real setting as a benchmark tool.

  • Our approach to interpretability is based on a theoretically sound concept (the Shapley Value) and our visualizations provide a novel way to annotate clinical free-text notes to highlight the most informative parts for the prediction of mortality.

This paper is organized as follows: first we will show the overall distribution of our patient cohort dataset and its corresponding distribution of medical free-text notes. Then we will briefly describe our approach to interpretability using the Shapley Value, followed by a description of our convolutional architecture and experimental setting. Finally we will present and discuss our results and end with our conclusions and suggested future work.

2 Methods and materials

2.1 Participants

We used the Medical Information Mart for Intensive Care (MIMIC-III v1.4) to create a dataset for the training of our deep learning model. MIMIC-III contains ICU records including vitals, laboratory, therapeutical and radiology reports, representing more than a decade of data from patients admitted to the ICUs of the Beth Israel Deaconess Center in Boston, Massachusetts Johnson et al. (2016). The median age of adult patients (those with age 16y) is 65.8 years, and the median length of stay (LoS) for ICU patients is 2.1 days (Q1-Q3: 1.2-4.6) Johnson et al. (2016).

Our patient cohort was created using the following criteria: only stays longer than 48 hours were considered, in cases where patients were admitted multiple times to the ICU only the first admission was considered, and patients should have at least one free-text note recorded during their ICU stay. These criteria lead to a sample with . Table 1 shows the different types of medical notes included in our dataset together with their respective counts.

Note Type Count Percentage
Nursing/other 83147 36.78%
Radiology 61096 27.02%
Nursing 43790 19.37%
Physician 27789 12.30%
Respiratory 5728 2.53%
General 1775 0.78%
Nutrition 1549 0.68%
Rehab Services 521 0.23%
Social Work 501 0.22%
Case Management 134 0.060%
Consult 40 0.018%
Pharmacy 12 0.0053%
Overall 226082 100%
Table 1: Distribution of free-text medical notes in our dataset.

Given that a substantial number of patients in our dataset were missing more than one type of medical note, and that nursing and nursing/other types were the more prevalent ones, we decided to only include patients that had some type of nursing note available (nursing, nursing/other), with no regard to the note word count. This reduced our patient sample to

, with 1659 recorded deaths (9.78%) and 15311 patients that survived (90.22%). The mean note length is 1252.59 words, with a standard deviation of 1087.48. Table

2 and figure 1 show details about the distribution of notes’ lengths. The median age of patients in our final sample is 67.2 years, and the median length of stay is 3.96 days (Q1-Q3:2.8-7.16). Figures 2 and 3 show the distribution of age and length of stay in our final sample.

Statistic Positive class (survival) Negative class (death) Overall
Count 15311 1659 16970
Mean 1233.3 1430.6 1252.6
Std 1083 1112.4 1087.5
Min 34 144 34
Q1 711.0 890 724
Q2 934 1135 952
Q3 1286 1492 1310
Max 33771 9756 33771
Table 2: Length distribution of nursing notes for our patient sample.
Figure 1: Estimated nursing notes length distribution.
Figure 2: Histogram of age distribution by outcome. As a result of privacy preserving measures, MIMIC-III shifts ages greater than 89 years (i.e. patients appear to be 300 years old).
Figure 3: Histogram of length of stay distribution by outcome.

2.2 Deep learning model

Our prediction model, called ISeeU2, is a convolutional neural network (ConvNet). ConvNets are a specialized neural network architecture that exploit the convolution operator and spatial pooling operations to detect local patterns and reduce input dimensionality to learn a representation that is useful for predictive purposes LeCun et al. (1998)

. ConvNets are extensively and primarily used for computer vision but have found application in Natural Language Processing as well, given their ability to deal with patterns occurring at different scales in sequential inputs

Goodfellow et al. (2016); Grnarova et al. (2016).

The specific architecture of our model (figure 4

) includes a text embedding layer to convert a bag of words text representation into 10-dimensional dense word vectors. The output of the embedding layer is then fed to a convolutional layer with 32 channels and a kernel size of

(stride 1), followed by ReLU activations and a max-pooling layer with a pool size of

(stride 1). The obtained representation is then fed to a

dense layer with ReLU activations connected to a one-neuron final layer with sigmoid activation, which computes the mortality probability.

Figure 4: Deep learning model architecture.

One argument that is used routinely against deep learning is its reduced interpretability when compared to other modeling techniques such as logistic regression Cooper et al. (1997). In order to overcome that potential limitation we use the Shapley Value in order to find how inputs affect the output of the model, hence gaining insight about which kinds of note fragments are more correlated with negative outcomes. The Shapley Value is a concept from coalitional game theory that formalizes the contribution of individual players towards the attainment of a goal as part of a team. The Shapley value captures the marginal importance of each player when its role is analyzed across all possible subsets of players from the original coalition. According to Shapley Shapley (1953), given a coalitional form game , with a finite set of players of size and a function that describes the total worth of the coalition, the marginal importance of player can be expressed as


The summation is taken over all possible subsets that don’t include player , and each of its terms captures the effect of player on the reward attained by each subset, .

Strumbelj et al Strumbelj et al. (2010) have shown in their work that it is possible to apply the Shapley Value to the problem of feature importance quantification, if inputs are considered players in a coalition, and the predicted value is akin to the attained reward. In this way the Shapley Value becomes very useful, as it takes into account the interaction between features, in a way other methods like tree-based feature importance or simple input occlusion cannot.

A downside of using the Shapley Value for model interpretability is that equation 1 has combinatorial cost, and that’s why using it may be unfeasible for practical purposes. In order to get around this limitation we resort to a fast approximate algorithm, DeepLIFT Shrikumar et al. (2017), to compute approximate Shapley Values in feasible time Lundberg and Lee (2017)

. DeepLIFT is an algorithm specifically designed to compute feature importance in feed-forward neural networks. DeepLIFT overcomes the issues associated with competing methods such as Layerwise Relevance Propagation

Shrikumar et al. (2017), and gradient-based attribution Simonyan et al. (2013); Springenberg et al. (2014), i.e. saturation, overlooking negative contributions, and gradient discontinuities Shrikumar et al. (2017). DeepLIFT computes feature importance by comparing the network output to a reference output obtained by feeding the network with a designated input. The difference in outputs is back-propagated through the different layers of the network until the input layer is reached and feature importances are fully computed. A more detailed treatment of DeepLIFT in the context of interpreting deep learning models for critical care prognosis can be found in Caicedo-Torres and Gutierrez (2019).

3 Results

Our ConvNet was built using Tensorflow

Abadi et al. (2015)

. Since our dataset is highly unbalanced (negative outcomes represent just 9.78% of training examples), we used a weighted logarithmic loss assigning more importance to the positive class, i.e. patients that died in the ICU. We used 5-fold cross-validation to assess the model performance and place a confidence estimate on it. We did not perform any substantial hyperparameter optimization other than conservatively varying the number of channels of the convolutional layer and the number of neurons of the first fully connected layer of the network. Our choice of optimizer was Adam

Kingma and Ba (2014)

with default Tensorflow-provided parameters. Our model was trained for three epochs per training fold, and we kept the lowest loss model of each run.

3.1 Text pre-processing

One of our goals is to show a deep learning model that needs little to no input pre-processing in order for it to be as widely applicable as possible. Keeping with that we used the NLTK library Loper and Bird (2002)

to remove English stop-words and the Tensorflow.keras default tokenizer to vectorize the text notes, keeping the 100k most frequent words; and no further pre-processing was attempted. The tokenizer was fitted only on the training folds to avoid data leakage. Finally, we set the maximum note length to 500, so notes with a larger word count were truncated at the beginning and those with a smaller word count were padded at the beginning with zeroes.

3.2 Model performance and comparison with baselines

Using this configuration we obtained a 5-fold cross validation Receiver Operating Characteristic Area Under the Curve (ROC AUC) of 0.8629 () as seen in figure 5

. Using a 0.5 decision threshold, the model reaches 72% sensitivity at 83% specificity. We also provide some baseline models to compare with our proposed model to better assess its performance. Concretely, we have included results for a traditionally used mortality risk score and a recurrent neural network.

Figure 5: ConvNet 5-fold cross validated ROC AUC.


As baseline, we used a well-established ICU mortality risk score, SAPS-II Gall et al. (1993). SAPS-II uses data from the first 24 hours of ICU stay to calculate a numerical score, which in turn is converted into a mortality probability. In order to compare our approach with SAPS-II predictions and performance, we trained our convolutional architecture using nursing notes from the first 24 hours only while keeping training parameters the same. We used the SAPS-II implementation provided by the authors of the MIMIC-III code repository Johnson et al. (2018). The 24 hour version of our model obtained a 0.8155 () ROC AUC 5-fold cross-validation score, against 0.7448 () for the SAPS-II model. Figures 6 and 7 show the corresponding ROC plots for the two models.

Figure 6: SAPS-II model 5-fold cross validated ROC AUC.
Figure 7: ConvNet (24h) 5-fold cross validated ROC AUC.


Our second baseline is a recurrent neural network based on the Long Short Term Memory (LSTM). LSTM is a neural network model designed to handle sequential input data with temporal dependencies Hochreiter and Schmidhuber (1997), and it has been used extensively in Natural Language Processing tasks. We trained a deep neural network with a bidirectional LSTM layer with 100 units, followed by an extra 100-unit LSTM layer, a 50-unit dense layer ReLU activation, and a final sigmoid layer. As it was the case for our original convolutional model, an embedding layer was used to create 10-dimensional dense vectors to feed the initial layer of the LSTM and the same text preprocessing pipeline was used (save for a now 1000-word maximum note length). Finally dropout with probability was applied to control overfitting. With this particular architecture we were able to obtain a 0.7839 () ROC AUC 5-fold cross-validation score (Figure 8).

Figure 8: Deep LSTM 5-fold cross validated ROC AUC.

3.3 Model interpretability

Using the DeepLIFT implementation provided by Lundberg and Lee (2017) which works appropriately with Tensorflow 2 models, we calculated word importances for our model, using the empirical mean of the input embedding vectors as reference value. Using these values we designed and built visualizations to show the importance of each word in the original nursing note used as input. Our visualizations constitute a form of post hoc interpretability Lipton (2016) insofar as they try to convey how the model regards the inputs in terms of their impact on the predicted probability of death, without having to explain the internal mechanisms of our neural network, nor sacrificing predictive performance. We have selected some examples at random from both the training set and the validation set of the last cross-validation run to show the behavior of the model and the way it regards certain words in the input notes. Our proposed visualizations include word clouds and text heatmaps (Figures 9 and 10 ).

Figure 9: Top: Word clouds generated for one specific patient in the training set show the words deemed as most important for both survival (left) and death (right) prediction. Bottom: Text heatmap showing words, their importance and their context in sentences, generated for one specific patient in the training set. Red color denotes evidence for death, and blue color represents evidence for survival. Words with a gray background are not considered important for the prediction task by the network. Padding characters are represented by asterisks.

Word clouds are an interesting way to visualize words and their importance at the same time, but they don’t capture the context in which words live, potentially leading to erroneous interpretations. For example, the survival word cloud in Figure 9 shows melena as associated with survival, which is not readily understandable. However, when the word cloud is combined with the note heatmap, the reason becomes apparent, given the context of the word (stable present wo melena stool). We also observe that certain phrases and words are flagged intuitively, e.g. guaic pos heme, and also the fact that for this particular patient occurrences of Plavix/Clopidogrel in the note are flagged as evidence for survival. There are other instances in which results are not intuitive and may point to statistical flukes rather than strong causal features. As an example we can point to the phrases return baseline bp numerous large clots suctioned, and continue keep pt family aware, in which the words clot and pt seem to be flagged incorrectly.

Annotation smoothing

In order to help ameliorate the sharp changes and inconsistencies observed at the sentence level we used a convolution filter to take into account the effect of the Shapley Values of all words in a particular sentence when generating the heatmap annotations, and provide a smoother and more intuitive result. Note that this is approximated since we are intent on using a basic pre-processing pipeline,, without any advanced capabilities (i.e. sentence segmentation). A convolution filter allows us to spread out the feature importance of individual words and to fade out weak importances that are due possibly to noise, while still keeping the most salient features. Figure 10 show our previous training set and a new validation set note with and without the convolution filter applied.

Figure 10: Text heatmaps with (right) and without (left) convolution smoothing. Bottom row corresponds to a nursing note from the validation set.

Note length and mortality probability

High capacity machine learning models such as deep neural networks have the ability to leverage subtle correlations and patterns to attain very low training error in learning tasks. As shown in Table 2 and Figure 1

, there is a difference in our sample between mean length of patients who survived and those who had a negative outcome. A Mann-Whitney U test supplied further evidence, as we were able to reject the null hypothesis, i.e. distributions of the length of nursery notes are the same (

), in favor of our alternative hypothesis, i.e. notes for patients that do not survive are longer.

Having established that, we decided to investigate if our model was attending somehow to that difference in distributions. For this purpose we inspected the importance score of the padding characters used by our pre-processing pipeline, with most of them being regarded as evidence for survival, which is consistent with our original conjecture that the model considers that shorter notes are correlated with a survival outcome (shorter notes have more padding characters). Figure 11 shows the distribution of approximate Shapley Values for padding characters.

Figure 11: Distribution of approximate Shapley Values for padding characters. The histogram shows that most padding characters are deemed as evidence for survival by our model.

4 Discussion

Our convolutional model shows interesting performance on the MIMIC-III dataset, with consistent results across validation folds, showing evidence for good generalization. Validation ROC AUC (95% CI [0.855689, 0.867888]) is competitive with published results in a comprehensive benchmark by Purushotham et al. (2018) (95% CI [0.873706, 0.882894] ROC AUC) and our previous work Caicedo-Torres and Gutierrez (2019) (95% CI [0.870396, 0.876604] ROC AUC). Moreover, our model bypass some of the most important difficulties associated with the usage of physiological time series, i.e. inconsistent sampling times and missing values. On the other hand, the 24-hour version of our model still manages to surpass comfortably the SAPS-II baseline.

Our results are not directly comparable to those published by Grnarova el al Grnarova et al. (2016) given that we restricted our input window to the first 48 hours of patient stay, instead of using all available notes up until the time of discharge. Results published by Jo et al Jo et al. (2017) show their models performing under 0.84 ROC AUC for mortality prediction using MIMIC-III data (48 hour mark), which is well below our results here. On the other hand, the model Vital + EntityEmb reported by Jin et al. (2018) uses physiological data and a substantial text preprocessing pipeline that involves a second neural network for Named Entity Recognition.Table 3 shows reported performance results for relevant models, compared with the performance of our model.

Model Type ROC AUC
Physiological models
GRU-D Che et al. (2016) Recurrent 0.8527 0.003
MMDL Purushotham et al. (2018) Hybrid 0.8783 0.0037
ISeeU ConvNet 0.8735 0.0025
NLP models
LSTM+E+T+D Jo et al. (2017) Recurrent 0.84
Vital + EntityEmb Jin et al. (2018) Hybrid (text & physiological inputs) 0.8734 0.0019
ISeeU2 (our work) ConvNet 0.8629 0.0058
Table 3: Our results and results reported by related works. ROC AUC results are mean and standard deviation from a 5-fold cross validation run, except LSTM+E+T+D and Vital + EntityEmb, which report a single result over the test set.

By using text notes as input we are using not the raw physiological data but healthcare workers perceptions and judgement in the form of free-text notes, giving us access to higher level concepts not present in said physiological data. On the other hand, MIMIC-III notes are very noisy, with frequent misspellings, typos and a lack of standardized naming (i.e. writing vancomycin vs vanc), which makes them not an optimal learning substrate. However, we have been able to show that deep learning models are able to separate useful signals from such noise, by keeping our pre-processing pipeline very basic. Another interesting take on the usage of free text notes is that deep models can leverage meta-data such as note length, as our evidence suggest. This phenomena is comparable to observations made in Lipton et al. (2016); Razavian and Sontag (2015) regarding the ability of deep neural networks to exploit patterns of missingness in physiological patient data to attain better predictive performance: Certain physiological measurements are taken more or less frequently according to the state of the patient, providing additional and useful metadata. This kind of flexibility and power is outside the reach of more traditional statistical modeling techniques as the ones behind risk scores as SAPS-II.

Our visualization approach allows to easily locate the parts of the notes the deep learning model is attending to, which can then be compared to clinical expectations. In this way the potential users can be certain that the reasons behind the predictions are sound and align properly with medical knowledge, as opposed to being evidence of statistical artifacts being leveraged by the model. It is interesting to note that our results suggest our model and text heatmap visualization could be used to annotate medical notes for, at some point, easier handling by ICU staff. Finally, it’s worth noticing that our usage of Shapley Values was instrumental to discover how the network regarded the padding introduced in shorter nursing notes.

5 Limitations and future work

Limitations of our study include the fact that we do not have access to some pre-admission data, and that we are using a retrospective, single center cohort. Also given the moderate size of our dataset we are only reporting cross-validation results without a proper test set result. An additional limitation is that high-quality nursing notes may not be available for a substantial number of patients in other critical care settings, which could hurt the performance of our model. Finally, the common misspellings and other noise present in the medical notes may affect the quality of the explanations, giving rise to counterintuitive results.

In future work we intend to investigate the usage of a more robust pre-processing pipeline, and assess whether there is any performance improvement attributable to its usage. Also we intend to evaluate how our approach fares in a situation where limited-quality notes are the only training data available. Finally we plan to explore the joint usage of physiological time series data and free-text medical notes to train a multi-modal deep learning model and compare its performance with our current approach.

6 Conclusion

In this paper we have presented ISeeU2, a convolutional neural network for the prediction of mortality using free-text nursing notes from MIMIC-III. We showed that our model is able to offer performance competitive with that of much more complex models with little text pre-processing, while at the same time providing visual explanations of feature importance based on coalitional game theory that allow users to gain insight on the reasons behind predicted outcomes. Our visualizations also provide a way to annotate free-text medical notes with markers to flag parts correlated with predictions of survival and death. We have also shown that nursing notes could be rich enough to capture the concepts needed for mortality prediction at a level of accuracy far higher than what is currently possible with traditional statistical techniques.

7 Acknowledgments

The authors thank Dr. Janet Liang, PhD (MBChB, FANZCA, FCICM), for her invaluable help and expertise. GPU access and computing services provided by the Data Centre at the Service and Cloud Computing Research Lab. Hosting services managed by Bumjun Kim, Senior Technician and ICT. The Data Centre is part of the School of Engineering, Computer and Mathematical Sciences, Auckland University of Technology.



  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, L. Kaiser, M. Kudlur, J. Levenberg, D. Man, R. Monga, S. Moore, D. Murray, J. Shlens, B. Steiner, I. Sutskever, P. Tucker, V. Vanhoucke, V. Vasudevan, O. Vinyals, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. None 1 (212), pp. 19. External Links: Document, 1603.04467, ISBN 0010-0277, ISSN 0270-6474, Link Cited by: §3.
  • W. Caicedo-Torres and J. Gutierrez (2019) ISeeU: Visually interpretable deep learning for mortality prediction inside the ICU. Journal of Biomedical Informatics 98, pp. 103269. External Links: Document, ISSN 1532-0464, Link Cited by: §1, §1, §2.2, §4.
  • Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu (2016) Recurrent Neural Networks for Multivariate Time Series with Missing Values. CoRR abs/1606.0. External Links: Link Cited by: Table 3.
  • G. F. Cooper, C. F. Aliferis, R. Ambrosino, J. Aronis, B. G. Buchanan, R. Caruana, M. J. Fine, C. Glymour, G. Gordon, B. H. Hanusa, J. E. Janosky, C. Meek, T. Mitchell, T. Richardson, and P. Spirtes (1997) An evaluation of machine-learning methods for predicting pneumonia mortality. Artificial Intelligence in Medicine. External Links: Document, ISBN 0933-3657, ISSN 09333657 Cited by: §1, §2.2.
  • E. J. Emanuel, G. Persad, R. Upshur, B. Thome, M. Parker, A. Glickman, C. Zhang, C. Boyle, M. Smith, and J. P. Phillips (2020) Fair Allocation of Scarce Medical Resources in the Time of Covid-19. New England Journal of Medicine. External Links: Document, ISSN 0028-4793 Cited by: §1.
  • J. R. Gall, S. Lemeshow, and F. Saulnier (1993) A New Simplified Acute Physiology Score (SAPS II) Based on a European/North American Multicenter Study. JAMA: The Journal of the American Medical Association. External Links: Document, 0402594v3, ISBN 0098-7484, ISSN 15383598 Cited by: §3.2.
  • I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep Learning. MIT Press. Note: url{} Cited by: §2.2.
  • G. Grasselli, A. Pesenti, and M. Cecconi (2020) Critical Care Utilization for the COVID-19 Outbreak in Lombardy, Italy. JAMA. External Links: Document, ISSN 0098-7484 Cited by: §1.
  • P. Grnarova, F. Schmidt, S. L. Hyland, and C. Eickhoff (2016) Neural Document Embeddings for Intensive Care Patient Mortality Prediction. CoRR abs/1612.0. Cited by: §1.1, §1.1, §1.1, §1, §2.2, §4.
  • S. Hochreiter and J. Schmidhuber (1997) Long Short-Term Memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Document, ISSN 0899-7667, Link Cited by: §3.2.
  • M. Jin, M. T. Bahadori, A. Colak, P. Bhatia, B. Celikkaya, R. Bhakta, S. Senthivel, M. Khalilia, D. Navarro, B. Zhang, T. Doman, A. Ravi, M. Liger, and T. Kass-hout (2018) Improving Hospital Mortality Prediction with Medical Named Entities and Multimodal Learning. External Links: 1811.12276 Cited by: §1.1, Table 3, §4.
  • Y. Jo, L. Lee, and S. Palaskar (2017) Combining LSTM and Latent Topic Modeling for Mortality Prediction. ArXiv abs/1709.0. Cited by: §1.1, §1.1, Table 3, §4.
  • A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-III, a freely accessible critical care database.. Sci Data 3, pp. 160035. External Links: Document, ISSN 2052-4463 (Electronic); 2052-4463 (Linking) Cited by: §1, §2.1.
  • A. E.W. Johnson, D. J. Stone, L. A. Celi, and T. J. Pollard (2018) The MIMIC Code Repository: Enabling reproducibility in critical care research. Journal of the American Medical Informatics Association. External Links: Document, ISSN 1527974X Cited by: §3.2.
  • D.~P. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. ArXiv e-prints. External Links: 1412.6980 Cited by: §3.
  • Y. LeCun, L. Bottou, Y. Bengio, and Haffner (1998) Gradient-Based Learning Applied to Document Recognition. In Proceedings of the IEEE, Vol. 86, pp. 2278–2324. Cited by: §2.2.
  • Z. C. Lipton, D. Kale, and R. Wetzel (2016) Directly Modeling Missing Data in Sequences with RNNs: Improved Classification of Clinical Time Series. In Proceedings of the 1st Machine Learning for Healthcare Conference, F. Doshi-Velez, J. Fackler, D. Kale, B. Wallace, and J. Weins (Eds.), Proceedings of Machine Learning Research, Vol. 56, Northeastern University, Boston, MA, USA, pp. 253–270. External Links: Link Cited by: §4.
  • Z. C. Lipton (2016) The Mythos of Model Interpretability. ICML Workshop on Human Interpretability in Machine Learning abs/1606.0, pp. 96–100. External Links: arXiv:1606.03490v1, Link Cited by: §3.3.
  • E. Loper and S. Bird (2002) NLTK. External Links: Document Cited by: §3.1.
  • S. M. Lundberg and S. I. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, External Links: 1705.07874, ISSN 10495258 Cited by: §2.2, §3.3.
  • S. Purushotham, C. Meng, Z. Che, and Y. Liu (2018) Benchmarking Deep Learning Models on Large Healthcare Datasets. Journal of Biomedical Informatics. External Links: Document, ISSN 1532-0464, Link Cited by: §1, Table 3, §4.
  • A. G. Rapsang and D. C. Shyam (2014) Scoring systems in the intensive care unit: A compendium. Indian Journal of Critical Care Medicine : Peer-reviewed, Official Publication of Indian Society of Critical Care Medicine 18 (4), pp. 220–228. External Links: Document, ISBN 0972-5229; 1998-359X, Link Cited by: §1.
  • N. Razavian and D. Sontag (2015) Temporal convolutional neural networks for diagnosis lab tests. 25 November abs/1511.0, pp. 1–17. External Links: Document, 1151.07938v1, ISBN 9781611970685, ISSN 0004-6361, Link Cited by: §4.
  • L. S. Shapley (1953) A Value for n-Person Games. In Contributions to the Theory of Games II, H. W. Kuhn and A. W. Tucker (Eds.), pp. 307–317. Cited by: §1.1, §2.2.
  • D. Shen, G. Wu, and H. Suk (2017) Deep Learning in Medical Image Analysis. Annual Review of Biomedical Engineering 19 (1), pp. null. Note: PMID: 28301734 External Links: Document, Link Cited by: §1.
  • B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi (2018) Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis.. IEEE journal of biomedical and health informatics 22 (5), pp. 1589–1604 (eng). External Links: Document, ISSN 2168-2208 (Electronic) Cited by: §1.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning Important Features Through Propagating Activation Differences. CoRR abs/1704.0. External Links: 1704.02685, Link Cited by: §2.2.
  • Y. Si and K. Roberts (2019) Deep Patient Representation of Clinical Notes via Multi-Task Learning for Mortality Prediction.. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science. External Links: ISSN 2153-4063 Cited by: §1.1, §1.1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. CoRR abs/1312.6. External Links: Link Cited by: §2.2.
  • J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller (2014) Striving for Simplicity: The All Convolutional Net. CoRR abs/1412.6. External Links: 1412.6806, Link Cited by: §2.2.
  • E. Strumbelj, I. Kononenko, and S. Wrobel (2010) An Efficient Explanation of Individual Classifications using Game Theory. Journal of Machine Learning Research. External Links: Document, 1606.05386, ISBN 15324435, ISSN 1532-4435 Cited by: §2.2.
  • M. Sushil, S. Šuster, K. Luyckx, and W. Daelemans (2018) Patient representation learning and interpretable evaluation using clinical notes. Journal of Biomedical Informatics. External Links: Document, 1807.01395, ISSN 15320464 Cited by: §1.1, §1.1.