With the advancement of medical technology, patients admitted into the intensive care unit (ICU) are monitored by different instruments on their bedside, which measure different vital signals about patient’s health. During their stay, doctors visit the patient intermittently for check-ups and make clinical notes about the patient’s health and physiological progress. These notes can be perceived as summarized expert knowledge about the patient’s state. All these data about instrument readings, procedures, lab events, and clinical notes are recorded for reference. Availability of ICU data and enormous progress in machine learning have opened up new possibilities for health care research. Monitoring patients in ICU is a challenging and high-cost task. Hence, predicting the condition of patients during their ICU stay can help plan better resource usage for patients that need it most in a cost-effective way. Prior works harutyunyan2017multitask; ghassemi2015multivariate; suresh2018learning; song2018attend; caballero2015dynamically have focused exclusively on modeling the problem using the time series signals from medical instruments. Expert knowledge from doctor’s notes has been ignored in the literature.
In this work, we use clinical notes in addition to the time-series data for improved prediction on benchmark ICU management tasks harutyunyan2017multitask
. While the time-series data is measured continuously, the doctor notes are charted at intermittent times. This creates a new challenge to model continuous time series and discrete-time note events jointly. We propose such a multi-modal deep neural network that comprises of recurrent units for the time-series and convolution network for the clinical notes. We demonstrate that adding clinical notes improves the performance on in-hospital mortality prediction, modeling decompensation, and length of stay forecasting tasks.
2 Related Work
We provide a review of machine learning approaches for clinical prediction tasks.
Biomedical natural language processing
. rios2015convolutional and baker2016cancer used convolutional neural networks to classify various biomedical articles. Pre-trained word and sentence embeddings have also shown good results for sentence similarity taskschen2018biosentvec. Recently, there is an interest in the community to use clinical notes for ICU related tasks jin2018improving; boag2018s; liu2019knowledge; huang2019clinicalbert. Given the long structured nature of the clinical text, we prefer convolutional neural networks over recurrent networks, as demonstrated in previous studies zhang2016rationale; boag2018s. The work closest to ours is jin2018improving, who use an aggregated word embeddings of clinical notes for in-hospital mortality prediction.
ICU management related literature
ICU management literature has focused exclusively on using time-series measurements for the prediction tasks harutyunyan2017multitask, ghassemi2015multivariate, suresh2018learning, song2018attend,caballero2015dynamically. Recurrent neural networks have been models of choice for these recent works, with additional gains from using attention or multi-task learning (song2018attend). xu2018raim accommodated supplemental information like diagnosis, medications, and lab events to improve model performance. We use RNNs for modeling time-series in this work, utilizing the setup identical to harutyunyan2017multitask.
Multi-modal learning has shown success in speech, natural language, and computer vision (ngiam2011multimodal, mao2014explain). Recently, a lot of work has been done using images/videos with natural language text (elliott2016multimodal). We use a similar intuition for utilizing clinical notes with time-series data for ICU management tasks. In the next section, we define the three benchmark tasks we evaluate in this work.
3 Prediction Tasks
We use the definitions of the benchmark tasks defined by harutyunyan2017multitask as the following three problems:
In-hospital Mortality: This is a binary classification problem to predict whether a patient dies before being discharged from the first two days of ICU data.
Decompensation: Focus is to detect patients who are physiologically declining. Decompensation is defined as a sequential prediction task where the model has to predict at each hour after ICU admission. Target at each hour is to predict the mortality of the patient within a 24 hour time window.
Length of Stay Forecasting (LOS): The benchmark defines LOS as a prediction of bucketed remaining ICU stay with a multi-class classification problem. Remaining ICU stay time is discretized into 10 buckets: days where first bucket, covers the patients staying for less than a day (24 hours) in ICU and so on. This is only done for the patients that did not die in ICU.
These tasks have been identified as key performance indicators of models that can be beneficial in ICU management in the literature. Most of the recent work has focused on using RNN to model the temporal dependency of the instrument time series signals for these tasks (harutyunyan2017multitask, song2018attend).
In this section, we describe the models used in this study. We start by introducing the notations used, then describe the baseline architecture, and finally present our proposed multimodal network.
For a patient’s length of ICU stay of hours, we have time series observations, at each time step (1 hour interval) measured by instruments along with doctor’s note recorded at irregular time stamps. Formally, for each patient’s ICU stay, we have time series data of length , and doctor notes charted at time , where is generally much smaller than . For in-hospital mortality prediction, is a binary label at hours, which indicates whether the person dies in ICU before being discharged. For decompensation prediction performed hourly, are the binary labels at each time step , which indicates whether the person dies in ICU within the next 24 hours. For LOS forecasting also performed hourly, are multi-class labels defined by buckets of the remaining length of stay of the patient in ICU. Finally, we denote as the concatenated doctor’s note during the ICU stay of the patient (i.e.,, from to ).
4.1 Baseline: Time-Series LSTM Model
Our baseline model is similar to the models defined by harutyunyan2017multitask. For all the three tasks, we used a Long Short Term Memory or LSTMhochreiter1997long network to model the temporal dependencies between the time series observations, . At each step, the LSTM composes the current input with its previous hidden state to generate its current hidden state ; that is, for to . The predictions for the three tasks are then performed with the corresponding hidden states as follows:
where , , and
are the probabilities for in-hospital mortality, decompensation, and LOS, respectively, and, , and are the respective weights of the fully-connected (FC) layer. Notice that the in-hospital mortality is predicted at end of 48 hours, while the predictions for decompensation and LOS tasks are done at each time step after first four hours of ICU stay. We trained the models using cross entropy (CE) loss defined as below.
4.2 Multi-Modal Neural Network
In our multimodal model, our goal is to improve the predictions by taking both the time series data and the doctor notes as input to the network.
Convolutional Feature Extractor for Doctor Notes.
As shown in Fig. 2, we adopt a convolutional approach similar to kim-2014-convolutional to extract the textual features from the doctor’s notes. For a piece of clinical note , our CNN takes the word embeddings
as input and applies 1D convolution operations, followed by max-pooling over time to generate a
dimensional feature vector, which is fed to the fully connected layer along side the LSTM output from time series signal (described in the next paragraph) for further processing. From now onwards, we denote the 1D convolution over note as .
Model for In-Hospital Mortality.
This model takes the time series signals and all notes to predict the mortality label at (). For this, is processed through an LSTM layer just like the baseline model in Sec. 4.1, and for the notes, we concatenate () all the notes to charted between to to generate a single document . More formally,
We use pre-trained word2vec embeddings mikolov2013distributed trained on both MIMIC-III clinical notes and PubMed articles to initialize our methods as it outperforms other embeddings as shown in chen2018biosentvec. We also freeze the embedding layer parameters, as we did not observe any improvement by fine-tuning them.
Model for Decompensation and Length of Stay.
Being sequential prediction problems, modeling decompensation and length-of-stay requires special technique to align the discrete text events to continuous time series signals, measured at 1 event per hour. Unlike in-hospital mortality, here we extract feature maps by processing each note independently using 1D convolution operations. For each time step , let denote the extracted text feature map to be used for prediction at time step . We compute as follows.
where is the number of doctor notes seen before time-step , and
is a decay hyperparameter tuned on a validation data. Notice thatis computed as a weighted sum of the feature vectors, where the weights are computed with an exponential decay function. The intuition behind using a decay is to give preference to recent notes as they better describe the current state of the patient.
The time series data is modeled using an LSTM as before. We concatenate the attenuated output from the CNN with the LSTM output for the prediction tasks as follows:
Both our baselines and multimodal networks are regularized using dropout and weight decay. We used Adam Optimizer to train all our models.
We used MIMIC-III johnson2016mimic dataset for all our experiments following harutyunyan2017multitask’s benchmark setup for processing the time series signals from ICU instruments. We use the same test-set defined in the benchmark and 15% of remaining data as validation set. For in-hospital mortality task, only those patients are considered who were admitted in the ICU for at least 48 hours. However, we dropped all clinical notes which doesn’t have any chart time associated and also dropped all the patients without any notes. Owing to this step, our results are not directly comparable to the numbers reported by harutyunyan2017multitask. Notes which have been charted before ICU admission are concatenated and treated as one note at . After pre-processing, the number of patients for in-hospital mortality is 11,579 and 22,353 for the other two tasks.
For in-hospital mortality task, best performing baseline and multimodal network have 256 hidden units LSTM cell. For convolution operation, we used 256 filters for each of kernel size 2, 3 and 4. For decompensation and LOS prediction, we used 64 hidden units for LSTM and 128 filters for each 2,3 and 4 size convolution filters. The best decay factor
for text features was 0.01. We implement our methods in TensorFlow tensorflow2015-whitepaper111The code can be found at https://github.com/kaggarwal/ClinicalNotesICU
. All our models were regularized using 0.2 dropout and 0.01 weight decay coefficient. We run the experiments 5 times with different initialization and report the mean and standard deviations.
We use Area Under Precision-Recall (AUCPR) metric for in-hospital mortality and decompensation tasks as they suffer from class imbalance with only 10% patients suffering mortality, following the benchmark. davis2006relationship suggest AUCPR for imbalanced class problems. We use Cohen’s linear weighted kappa, which measures the correlation between predicted and actual multi-class buckets to evaluate LOS in accordance with with harutyunyan2017multitask.
We compared the multimodal network with the baseline time series LSTM models for all three tasks. Results from our experiments are documented in Table 1. Our proposed multimodal network outperforms the time series models for all three tasks. For in-hospital mortality prediction, we see an improvement of around 7.8% over the baseline time series LSTM model. The other two problems were more challenging itself than the first task, and modeling the notes for the sequential task was difficult. With our multimodal network, we saw an improvement of around 6% and 3.5% for decompensation and LOS, respectively.
We did not observe a change in performance with respect to results reported in benchmark harutyunyan2017multitask study despite dropping patients with no notes or chart time. In order to understand the predictive power of clinical notes, we also train text only models using CNN part from our proposed model. Additionally, we try average word embedding without CNN as another method to extract feature from the text as a baseline. Text-only-models perform poorly compared to time-series baseline. Hence, text can only provide additional predictive power on top of time-series data.
|Baseline (No Text)||0.844||0.487|
|MultiModal - Avg WE||0.851||0.492|
|MultiModal - 1DCNN||0.865||0.525|
|Baseline (No Text)||0.892||0.325|
|MultiModal - Avg WE||0.902||0.311|
|MultiModal - 1DCNN||0.907||0.345|
|Length of Stay|
|Baseline (No Text)||0.438|
|MultiModal - Avg WE||0.449|
|MultiModal - 1DCNN||0.453|
Identifying the patient’s condition in advance is of critical importance for acute care and ICU management. Literature has exclusively focused on using time-series measurements from ICU instruments to this end. In this work, we demonstrate that utilizing clinical notes along with time-series data can improve the prediction performance significantly. In the future, we expect to improve more using advanced models for the clinical notes since text summarizes expert knowledge about a patient’s condition.