Neural Document Embeddings for Intensive Care Patient Mortality Prediction

12/01/2016 ∙ by Paulina Grnarova, et al. ∙ 0

We present an automatic mortality prediction scheme based on the unstructured textual content of clinical notes. Proposing a convolutional document embedding approach, our empirical investigation using the MIMIC-III intensive care database shows significant performance gains compared to previously employed methods such as latent topic distributions or generic doc2vec embeddings. These improvements are especially pronounced for the difficult problem of post-discharge mortality prediction.



page 1

page 2

page 3

page 4

Code Repositories


Uses a word based CNN to classify Movie Reviews

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The steadily growing amount of digitized clinical data such as health records, scholarly medical literature, systematic reviews of substances and procedures, or descriptions of clinical trials holds significant potential for exploitation by automatic inference and data mining techniques. Besides the wide range of clinical research questions such as drug-to-drug interactions wienkers2005predicting or quantitative population studies of disease properties wren2005data , there is a rich potential for applying data-driven methods in daily clinical practice for key tasks such as decision support kawamoto2005improving or patient mortality prediction moreno2005saps . The latter task is especially important in clinical practice when prioritizing allocation of scarce resources or determining the frequency and intensity of post-discharge care.

There has been an active line of work towards establishing probabilistic estimators of patient mortality both in the clinical institution as well as after discharge 

pirracchio2015mortality ; che2016recurrent ; johnson2016machine . The authors report solid performance on both publicly available and proprietary clinical datasets.

In spite of these encouraging findings, we note that most competitive approaches rely on time series and demographic information while algorithmic processing of the unstructured textual portion of clinical notes remains an important, yet, to date, insufficiently studied problem. The few existing advances towards tapping into this rich source of information rely on term-wise representations such as tf-idf embeddings ghassemi2014unfolding or distributions across latent topic spaces lehman2012risk .

This intuitively appears sub-optimal since several studies have independently highlighted the importance of accounting for phrase compositionality manifested, e.g., in the form of negations kuhn2016implicit , or long-range dependencies in clinical resources. Models that solely rely on point estimates of term semantics cannot be assumed to adequately capture such interactions.

In this paper, we aim to address these shortcomings by presenting a convolutional neural network architecture that explicitly represents not just individual terms but also entire phrases or documents in a way that preserves such subtleties of natural language.

The remainder of this paper is structured as follows: Section 2 introduces our model and our objective function. Subsequently, in Section 3, we empirically evaluate the model against two competitive baselines on the task of intensive care unit (ICU) mortality prediction on the popular MIMIC-III database johnson2016mimic . Finally, Section 4 concludes with a brief discussion of our findings.

2 Model

While simple feed-forward architectures, such as the doc2vec scheme le2014distributed

, have been established as versatile plug-in modules in many machine learning applications 

lee2016sentiment ; lau2016empirical , they are inherently incapable of directly recognizing complex multi-word or multi-sentence patterns. However, constructions such as no sign of pneumothorax are frequently encountered in clinical notes and encode crucial information for the task of mortality prediction.

Following recent work in document classification yang2016 and dialogue systems serbanSBCP15 , we adopt a two-layer architecture. Let denote a patient’s record comprising sentences. Our first layer independently maps sentences

to sentence vectors

. The second layer combines into a single patient representation

. For both levels we use convolutional neural networks (CNNs) with max-pooling which have shown excellent results on binary text classification tasks

kim2014 , severyn2015 . Following work by Severyn et al severyn2015 , we use word-embeddings to provide vector-input for the first CNN layer. Finally, the output of our model is

, the estimated mortality probability, and our objective is the cross entropy

where is the ground-truth label. The graph rendered in black in Figure 1 depicts this basic architecture.







Figure 1: Model architecture: In black, our basic architecture. In red, target replication. In violet, optional note information introduced in the next section. The CNN layers are depicted by double arrows. For clarity, we omit the word-vectors that serve as input to the initial CNN.

Target replication

The performance of the basic model presented above is promising but not yet satisfying. For similar long-sequence prediction problems, lipton15replication and dai2015 have noted that it is beneficial to replicate the loss at intermediate steps. Following their approach, we compute an individual softmax mortality probability for every sentence and incorporate additional cross entropy terms into our final objective. For a corpus containing patients and corresponding labels we seek to minimize:


can be interpreted as the average prediction error at the sentences level, effectively bringing the classification loss closer to the word-level and regularizing the first CNN to learn sentence representations tailored to the mortality prediction problem. The hyper-parameter determines the strength of the regularizer.

Incorporating note information

End-to-end neural network architectures such as ours allow for easy incorporation of additional information that can increase predictive power. Every note in our collection has a category associated such as nursing, physician or social work

. Providing this information to our classifier can help to reliably assess the importance of individual sentences for the classification task. To exploit this information, we embed all 14 categories into a vector space

and concatenate every sentence vector with its associated category vector .

3 Experiments

We evaluate the proposed method on three standardized ICU mortality prediction tasks. On the basis of a patient’s electronic health record, we predict whether the patient will die (1) during the hospital stay, (2) within 30 days after discharge, or, (3) within 1 year after discharge, and report AUC as an evaluation measure.

3.1 Data

MIMIC-III johnson2016mimic is an openly-accessible critical care database, comprising 46,520 patients with 58,976 hospital stays. It contains measurements of patient state (through vital sign, lab tests and other variables) as well as procedures and treatments. Crucially, it also contains over 2 million unstructured textual notes written by healthcare providers.

Following the data filtering and pre-processing steps in ghassemi2014unfolding , we restrict to adults (18 years old) with only one hospital admission. Most importantly, we exclude notes from the discharge summary category and any notes recorded after the patient was discharged. This results in 31,244 patients with 812,158 notes. 13.82% of patients died in the hospital, 3.70% were discharged and died within thirty days, and 12.06% were discharged and died within a year. We randomly sample 10% of the patients for the test set, and 10% for the validation set. The remaining 80% of the patients are used during training. We construct the vocabulary by keeping the 300K most frequent words across all notes and replace all the words which are not part of the vocabulary with an out-of-vocabulary token.

3.2 Baselines

LDA based model

We recreate the LDA-based Retrospective Topic Model from  ghassemi2014unfolding

. This model is the state-of-the-art method for mortality prediction on unstructured data from MIMIC II. We recreate the model on MIMIC III, and closely follow their preprocessing and hyperparameter settings. We tokenize each note and remove all stopwords using the Onix stopword list The vocabulary is constructed as the union of the 500 most informative words in each patient’s note based on a tf-idf metric. All words which are not part of the vocabulary are removed. We keep the number of topics to be 50 and set the LDA priors for the topic distributions and the topic-word distributions to and , respectively. We train a separate linear kernel SVM on the per-note topic distributions to predict the mortality for each task.

Since SVM classifiers are sensitive to significant class-imbalances, we follow ghassemi2014unfolding in randomly sub-sampling the patients who did not die in the training sets to reach a ratio of 70%/30% between the negative and positive class. We do not modify the distribution of classes within the test and validation set. The LDA vectors are trained on the entire training data, but the SVM classifiers are trained using the vectors from the down-sampled training sets only.

Feed-forward Neural Network

As our second baseline we use the popular distributed bag of words (DBOW) scheme proposed by Le and Mikolov le2014distributed . In a range of initial experiments, we determined the DBOW architecture (rather than the distributed memory alternative) and an embedding space dimensionality of to be optimal in terms of accuracy and generality. Using the same pre-processing as for the LDA baseline, we train separate linear SVMs for each task.

3.3 Parameters and Pretraining

We pre-train 50-dimensional word vectors on the training data using the word2vec implementation of the gensim gensim toolbox. Our word-level CNN uses 50 filters of sizes 3, 4 and 5 resulting in a sentence representation of size . We embed categories in dimensional space and use 50 filters of size 3 for the sentence-level CNN resulting in a patient representation of size . Furthermore, we regularize the fully connected layer before our final softmax by l2-regularization on the weights and dropout with keep probability 0.8.

3.4 Results

Table 1 summarizes the results of the three models on all tasks. Across all methods there seems to be a general tendency that labels further in the future are harder to predict. We observe that both neural models are superior to the LDA baseline, in particular on the two harder tasks. Furthermore, our two-level CNN model outperforms doc2vec by a significant margin on all tasks.

Task LDA doc2vec CNN
Hospital 0.930 0.930 0.963
30-day 0.800 0.831 0.858
1-year 0.790 0.824 0.853
Table 1: MIMIC-III Mortality Prediction AUC

To highlight the effectiveness of the target replication, Table 2 shows the results of our model with and without target replication. We report on 30-days post-discharge, but performance on the other tasks is comparable.

Model without target replication with target replication
AUC 0.682 0.858
Table 2: Performance analysis for target replication

The results of our CNN show that modeling sentence and document structure explicitly results in noticeable performance gains. In addition, learning sentence representations and training them in our regularizer on the classification task, enables us to retrieve a patient’s most informative sentences. This allows an inspection of the model’s features, similar to LDA’s topic distributions but on the sentence level. This stands in stark contrast to doc2vec’s generic document representations. To showcase these features, Table 3 shows a patient’s top five sentences indicating likelihoods of survival and death respectively.

P(survival) high the remaining support lines are unchanged .
no effusion .
the cardiomediastinal contours are normal .
P(survival) low now found to have metastatic lesions in her brain .
impression UNK multiple large enhancing masses within the brain with
bla    surrounding vasogenic edema most consistent with .
enhancing lesions in the right temporal lobe and right mid brain consistent
bla   with metastatic disease .
Table 3: The three highest and three lowest scoring sentences of one patient in the 1-year task.

While most patients’ top-scoring sentences look promising, a careful study of the predictions reveals that some neutral sentences can be ranked too highly in either direction. This is due to the model’s inability to appropriately handle sentences that do not help to distinguish the two classes. We plan to address this in the future by a more advanced attention mechanism.

4 Conclusion

In this paper we developed a two-layer convolutional neural network for the problem of ICU mortality prediction. On the MIMIC-III critical care database our model outperforms both existing BOW approaches and the popular doc2vec neural document embedding technique on all three tasks. We conclude that accounting for word and phrase compositionality is crucial for identifying important text patterns. Such findings have impact beyond the immediate context of automatic prediction tasks and suggest promising directions for clinical machine learning research to reduce patient mortality.