Dynamically Extracting Outcome-Specific Problem Lists from Clinical Notes with Guided Multi-Headed Attention

Problem lists are intended to provide clinicians with a relevant summary of patient medical issues and are embedded in many electronic health record systems. Despite their importance, problem lists are often cluttered with resolved or currently irrelevant conditions. In this work, we develop a novel end-to-end framework that first extracts diagnosis and procedure information from clinical notes and subsequently uses the extracted medical problems to predict patient outcomes. This framework is both more performant and more interpretable than existing models used within the domain, achieving an AU-ROC of 0.710 for bounceback readmission and 0.869 for in-hospital mortality occurring after ICU discharge. We identify risk factors for both readmission and mortality outcomes and demonstrate that our framework can be used to develop dynamic problem lists that present clinical problems along with their quantitative importance. We conduct a qualitative user study with medical experts and demonstrate that they view the lists produced by our framework favorably and find them to be a more effective clinical decision support tool than a strong baseline.



There are no comments yet.



Enriching Unsupervised User Embedding via Medical Concepts

Clinical notes in Electronic Health Records (EHR) present rich documente...

Explainable Prediction of Adverse Outcomes Using Clinical Notes

Clinical notes contain a large amount of clinically valuable information...

Literature-Augmented Clinical Outcome Prediction

Predictive models for medical outcomes hold great promise for enhancing ...

Clinical Utility of the Automatic Phenotype Annotation in Unstructured Clinical Notes: ICU Use Cases

Clinical notes contain information not present elsewhere, including drug...

Improving Clinical Outcome Predictions Using Convolution over Medical Entities with Multimodal Learning

Early prediction of mortality and length of stay(LOS) of a patient is vi...

Clinical Outcome Prediction from Admission Notes using Self-Supervised Knowledge Integration

Outcome prediction from clinical text can prevent doctors from overlooki...

Label-dependent and event-guided interpretable disease risk prediction using EHRs

Electronic health records (EHRs) contain patients' heterogeneous data th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Problem lists are an important component of the electronic health record (EHR) that are intended to present a clear and comprehensive overview of a patient’s medical problems. These lists document illnesses, injuries, and other details that may be relevant for providing patient care and are intended to allow clinicians to quickly gain an understanding of the pertinent details necessary to make informed medical decisions and provide patients with personalized care (AHIMA; h11). Despite their potential utility, there are shortcomings with problem lists in practice. One such shortcoming is that problem lists have been shown to suffer from a great deal of clutter Holmes2012. Irrelevant or resolved conditions accumulate over time, leading to a problem list that is overwhelming and difficult for a clinician to quickly understand. This directly impairs the ability of a problem list to serve its original purpose of providing a clear and concise overview of a patient’s medical condition.

A challenge that comes with attempting to reduce clutter is that many conditions on the list may be relevant in certain situations, but contribute to clutter in others. For example, if a patient ends up in the intensive care unit (ICU), a care unit for patients with serious medical conditions, then the attending physician likely does not care about the patient’s history of joint pain. That information, however, would be important for a primary care physician to follow up on during future visits. In this case, the inclusion of chronic joint pain clutters the list for the attending physician in the ICU, but removing it from the list could decrease the quality of care that the patient receives from his/her primary care physician.

In this work, we address this problem by developing a novel end-to-end framework to extract problems from the textual narrative and then utilize the extracted problems to predict the likelihood of an outcome of interest. Although our framework is generalizeable to any clinical outcome of interest, we focus on ICU readmission and patient mortality in this work to demonstrate its utility. We extract dynamic problem lists by utilizing problem extraction as an intermediate learning objective to develop an interpretable patient representation that is then used to predict the likelihood of the target outcome. By identifying the extracted problems important for the final prediction, we can produce a problem list tailored to a specific outcome of interest.

We demonstrate that this framework is both more interpretable and more performant than the current state-of-the-art work using clinical notes for the prediction of clinical outcomes (analysis_attn_clinical; khadanga2019using; nyu). Utilizing the intermediate problem list for the final outcome prediction allows clinicians to gain a clearer understanding of the model’s reasoning than prior work that only highlighted important sections of the narrative. This is because our framework directly identifies clinically meaningful problems while the prior work requires a great deal of inference and guesswork on the part of the clinician to interpret what clinical signal is being represented by the highlighted text.

For example, prior work predicting the onset of heart disease found that the word “daughter” was predictive of that outcome. The authors stated that the word usually arose in the context of the patient being brought in by their daughter which likely signaled poor health and advanced age (nyu). While this makes sense after reviewing a large number of notes, this connection is not immediately obvious and a clinician would not have the time to conduct the necessary investigation to identify such a connection. By instead directly extracting predefined clinical conditions and procedures and using those for the final prediction, we reduce the need for such inference on the part of the physician.

The primary contributions of this work are:

  • A novel end-to-end framework for the extraction of clinical problems and the prediction of clinical outcomes that is both more interpretable and performant than models used in prior work.

  • An expert evaluation that demonstrates that our problem extraction model exhibits robustness to labeling errors contained in a real world clinical dataset.

  • Dynamic problem lists that report the quantitative importance of each extracted problem to an outcome of interest, providing clinicians with a concise overview of a patient’s medical state and a clear understanding of the factors responsible for the model’s prediction.

  • A qualitative expert user study that demonstrates that our dynamic problem lists offer statistically significant improvements over a strong baseline as a clinical decision support tool.

Generalizable Insights about Machine Learning in the Context of Healthcare

A significant body of past work develops predictive models that can not be used in clinically useful settings due to their reliance on billing codes assigned after a patient leaves the hospital (harlan_claims; He2014; Ghassemi2014; arash; sun1; Barbieri2020). While there may be value in the technical innovations made by such work, research that acknowledges and addresses the constraints of the domain is essential to develop methods that can actually be implemented in practice. We demonstrate that recent methods for automated ICD code assignment are sufficiently performant to extract billing information in real-time for downstream modeling tasks. Although we focus on extracting problem lists for clinical decision support in this work, this finding has broader ramifications for the field. It both enables the real-time implementation of previously impracticable work and paves the way for future work to develop clinically feasible models that utilize dynamically extracted diagnosis and procedure information from clinical text.

2 Related Work

There has been a large body of prior work utilizing natural language processing (NLP) techniques to extract information from clinical narratives.

sontag demonstrated that unstructured clinical notes could be used to effectively identify patients with heart failure in real time. Their methods that involved data from clinical notes outperformed those using only structured data, demonstrating the importance of effectively utilizing the rich source of information contained within the clinical narrative.

Prior work has found success predicting ICD code assignment using clinical notes within MIMIC-III and has found that deep learning techniques outperform traditional methods

(elhadad; caml; soa; multi_icd). caml augmented a convolutional model with a per-label attention mechanism and found that it led to both improved performance and greater interpretability as measured by a qualitative, expert evaluation. soa

later improved upon their model by utilizing multiple convolutions of different widths and then max-pooling across the channels before the attention mechanism.

There has also been work done demonstrating that machine learning models can effectively leverage the unstructured clinical narrative for the prediction of clinical outcomes (Ghassemi2014; nyu; analysis_attn_clinical). analysis_attn_clinical

augmented long short-term memory networks (LSTMs) with an attention mechanism and applied it to predict clinical outcomes such as mortality and ICU readmission. However, when defining readmission, they treated both ICU readmissions and deaths as positive examples. The clinical work by

harlan has demonstrated that these are orthogonal outcomes, and thus modeling them jointly as a single outcome does not make sense from a clinical perspective. By treating them as separate outcomes in this work, we are able to independently explore the risk factors for these two distinct outcomes.

analysis_attn_clinical also raised some questions about the interpretability of attention in their work with clinical notes, repeating the experiments introduced by attn_not to evaluate the explanatory capabilities of attention. However, attn_not_not explored some of the problems with their underlying assumptions and experimental setup and demonstrated that their experiment failed to fully explore their premise, and thus failed to support their claim.

Figure 1: Outcomes explored in this work

3 Data and cohort

This work is conducted using the free text notes stored in the publicly available MIMIC-III database (mimic). The database contains de-identified clinical data for over forty thousand patients who stayed in the critical care units of the Beth Israel Deaconess Medical Center. This information was collected as part of routine clinical care and, as such, is representative of the information that would be available to clinicians in real-time. This makes the dataset well-suited for developing clinical models.

To develop our cohort, we first filter out minors because children have different root causes for adverse medical outcomes than the general populace. We also remove patients who died while in the ICU and filter out ICU stays that are missing information regarding the time of admission or discharge. We then extract all ICU stays where the patient had at least three notes on record before the time of ICU discharge to develop a cohort with a meaningful textual history. This leaves us with unique patients and ICU stays.

For ICU readmission we extract labels for two types of readmissions, bounceback and 30 day readmisssion. Bounceback readmissions occur when a patient is discharged from the ICU and then readmitted to the ICU before being discharged from the hospital. For 30 day readmissions, we simply look at any readmission to the ICU within the 30 days following ICU discharge. For mortality, we also look at two different outcomes, in-hospital mortality and 30-day mortality. Because we use all data available at the time of ICU discharge, in-hospital mortality is constrained to mortality that occurs after ICU discharge but prior to hospital discharge. All the outcomes that we explored in this work are laid out in Figure 1. This provides us with a cohort with bounceback readmissions, 30-day readmissions, deaths within 30 days, and in-hospital deaths. For our experiments, we then split our cohort into training, validation, and testing splits following an 80/10/10 split and use 5-fold cross validation. We divide our cohort based on the patient rather than the ICU stay to avoid data leakage when one patient has multiple ICU stays.

We extract all clinical notes associated with a patient’s hospital stay up until the time of their discharge from the ICU. The text is then preprocessed by lowercasing the text, normalizing punctuation, and replacing numerical characters and de-identified information with generic tokens. All of the notes for each patient are then concatenated and treated as a continuous sequence of text which is used as the input to all of our models. We truncate or pad all clinical narratives to 8000 tokens. This captures the entire clinical narrative for over

of patients and we found that extending the maximum sequence length beyond that point did not lead to any further improvements in performance.

4 Methods

In this work, we develop an end-to-end framework to jointly extract problems from the clinical narrative and then use those problems to predict a target outcome of interest. An overview of our framework can be seen in Figure 2

. We embed the clinical notes using learned word embeddings and then apply a convolutional attention model with a guided multi-headed attention mechanism to extract problems from the narrative. We then utilize the intermediate problem predictions to predict the target outcome. This differs from standard deep learning models because the features used for our final prediction are clearly mapped to clinically meaningful problems rather than opaque learned features. We also describe the training procedure that we develop to ensure that our problem extraction model maintains a high level of performance, something that is essential for the intermediate features to maintain their clinical significance.

Figure 2: Overview of our proposed framework

4.1 Embedding techniques

We utilize all notes in the MIMIC-III database associated with subjects who are not in our testing set to train embeddings using the Word2Vec method (w2v). This allows for training on a greater selection of notes than if training had been limited to the training set. This training is done using the continuous bag-of-words implementation and it generates embeddings for all words that appear in at least 5 notes in our corpus. We replace out-of-vocabulary words with a randomly initialized UNK token to represent unknown words. Both 100 and 300 dimensional word embeddings were explored and early testing showed that 100 dimensional word embeddings led to better performance.

4.2 Target Problems

We experiment with multiple different representations for the intermediate problems in this work. The first representation we explore are the ICD9 codes assigned to all hospital stays in our dataset. These codes are used for billing purposes and represent diagnostic and procedure information for each patient. Although prior work has found that these codes are predictive of adverse outcomes (Ghassemi2014; arash; Barbieri2020), these codes are assigned after a patient has been discharged from the hospital and, as such, directly using these codes as features in a predictive model limits the clinical utility of such a model. By instead learning to dynamically assign these codes within our framework, we can use these codes to predict the outcomes we explore using only the information available at the time of prediction.

However, the large ICD9 label space will likely hinder our frameworks’s ability to effectively extract and utilize the codes. To address this, we leverage the heirarchical nature of the ICD9 taxonomy. Full ICD9 codes are represented by character strings up to characters in length where each subsequent character represents a finer grained distinction. We experiment with rolled up ICD9 codes which consist of only the first three characters of each ICD9 code to address the problem of the large label space. The rolled up codes still represent clinically meaningful procedures and conditions while substantially reducing the number of labels.

We also explore using phecodes which were developed to conduct phenome-wide association studies (PheWAS) in EHRs (phewas). Prior work demonstrated that phecodes better represent clinically meaningful phenotypes than ICD9 codes (Wei_2017). Because of this, phecodes may lead to a more clinically meaningful and predictive intermediate representations than ICD9 diagnosis codes. A mapping from ICD9 codes to phecodes already exists and can be used to extract phecodes from our dataset. Similar to ICD9 codes, we explore both full and rolled up phecodes. For every problem representation in this work, we only use codes that occur at least times in our training set to reduce label sparsity. After this filtering, there are an average of full ICD diagnosis codes, full ICD procedure codes, full phecodes codes, rolled ICD diagnosis codes, rolled ICD procedure codes, and rolled phecodes across our folds.

4.3 Problem extraction model

Figure 3: Illustration of our problem extraction model with a single attention mechanism shown.

The convolutional attention architecture used in this work is similar to that developed by caml and soa for automatic ICD code assignment. The model can be described as follows. We represent the clinical narrative as a sequence of -dimensional dense word embeddings. Those word embeddings are then concatenated to create the matrix where is the length of the clinical narrative and is the word embedding for the

word in the narrative. We then apply a convolutional neural network (CNN) to the matrix


In this work, we use three convolutional filters of width 1, 2, and 3 with output dimensionality

. These filters convolve over the textual input with a stride of 1, applying the learned filters to every 1-gram, 2-gram. and 3-gram in the input. In this work, we augment the CNN with a multi-headed attention mechanism where each head is associated with a problem

(attn_all_you_need). Unlike the work of caml and soa, we apply our attention mechanisms over multiple convolutional filters of different lengths. This allows our model to consider variable spans of text while still maintaining the straightforward interpretability of the model introduced by caml.

To apply the attention mechanisms, we learn a query vector,

, for each problem that will be used to calculate the importance of the feature maps across all filters for that problem. We calculate the importance using the dot product of each feature map with the query vector. We let be the concatenated output of our CNN and can then calculate the attention distribution over all of the feature maps simultaneously using the matrix vector product of our final feature map and the query vector as where is used as a scaling factor and contains the score for every position across all the filters. The softmax operation is used so that the score distribution is normalized. We calculate the final representation used for classification for problem by taking a weighted average of all of the outputs based on their calculated weights given by where is the feature vector in and is the final representation used for predicting the presence of problem .

Given the representation , we calculate the final prediction as where is a vector of learned weights, is the bias term, and

is the sigmoid function. We train our problem extraction model by minimizing the binary cross-entropy loss function given by

where is the ground truth label and is our model’s prediction for problem .

4.4 Outcome classification

In our proposed framework, the feature vector used for the outcome prediction is where and is the scalar score for problem defined by . We calculate our final prediction using this vector similarly to our intermediate problem prediction as . Using the score for each outcome as the features for the final prediction allows for the straightforward interpretation of each feature. This differs from the standard deep learning models used in prior works where the final feature vector used for the prediction is composed of learned features that are not interpretable. We utilize this improvement to explain our model’s decision making process and to develop dynamic problem lists.

To optimize the classification objective for our target outcome, we also minimize the binary cross-entropy loss function where is the ground truth label for our target outcome and is our model’s prediction for that outcome.

4.5 Training procedure

For our intermediate features to be interpretable, it is important for our problem extraction model to maintain a high level of performance. This motivates the development of our training procedure. We define a threshold for the performance of our problem extraction model and train only that component of our framework if the validation performance falls below that threshold. This ensures that we are only training the final classification layer using intermediate representations that effectively represent their corresponding problem. This also prevents our target classification objective from degrading the performance of our problem extraction model as that would harm the interpretability and clinical utility of our framework.

Thus our final loss function can be defined as where is the validation performance and is a pre-defined performance threshold. We measure the performance of our problem extraction model by calculating the micro-averaged Area Under the Receiver Operating Curve (AU-ROC) on the validation set and use a threshold of for the models trained in this work. We found this training procedure to be necessary to maintain good problem extraction performance for problem configurations that involved full codes while the configurations with rolled codes were able to maintain performance during joint training. We optimize our final loss function using the Adam optimizer (adam). Our code is made publicly available111https://github.com/justinlovelace/Dynamic-Problem-Lists and we relegate full implementation details to the appendix.

5 Experiments and results

5.1 Baselines

To evaluate the efficacy of our proposed framework at predicting our target outcomes, we develop three strong baselines based on recent work for clinical outcome prediction using clinical text (khadanga2019using; nyu; analysis_attn_clinical). The first baseline is the convolutional model developed by CNN for text classification. This model consists of three convolutions of width 1, 2, and 3 which are applied over the clinical narrative and then max-pooled. The three pooled representations are then concatenated and used for the final prediction.

The second baseline is similar to the model used for problem extraction in our proposed framework and is a straightforward extension of the model proposed by caml. Unlike our problem extraction model, this baseline utilizes a single attention head and directly predicts the outcome of interest. This baseline allows us to not only compare the predictive performance of our model, but to also explore the improved interpretability that our framework provides. For our third baseline, we use a bidirectional LSTM augmented with an additive attention mechanism which was used by analysis_attn_clinical in their work predicting clinical outcomes from notes.

5.2 Outcome Results

For each outcome in this work, we explore using both full and rolled ICD codes and phecodes as our intermediate problems. To gain insight into the effectiveness of each subset of codes, we also explore using only the rolled ICD diagnosis codes, rolled ICD procedure codes, and rolled phecodes. For every model, we report the mean and standard deviation across the five testing folds for the area under the Receiver Operating Curve (AU-ROC) and the area under the Precision-Recall Curve (AUC-PR) to evaluate the effectiveness of our models. The results for all of the outcomes explored in this work can be found in Table


As expected, we find that trying to use the entire set of ICD codes for our intermediate problem representation is relatively ineffective, being outperformed by at least one of our baselines across all outcomes. We also observe that this problem extends to trying to utilize the full set of phecodes. However, we find that our model is very effective when using rolled ICD codes or phecodes. When using rolled codes, we find that our proposed framework outperforms all baselines with multiple different problem configurations across all outcomes and performance metrics.

Somewhat surprisingly, we find that using the individual subsets of codes does not lead to any loss in performance and appears to marginally improve performance. It is possible that the additional information provided by combining diagnostic and procedure codes is offset by difficulties that come from increasing the label space. We find that our framework leads to not only improved clinical utility (which we demonstrate later in this work), but also improved predictive performance.

5.3 Problem Extraction Results

center Model Problem Set In-Hospital Mortality 30-Day Mortality AU-ROC AU-PR AU-ROC AU-PR CNN-Max - Conv-Attn - LSTM-Attn - DynPL & DynPL F-Phe & DynPL & DynPL R-Phe & DynPL DynPL DynPL R-Phe Model Problem Set Bounceback Readmission 30-Day Readmission AU-ROC AU-PR AU-ROC AU-PR CNN-Max - Conv-Attn - LSTM-Attn - DynPL & DynPL F-Phe & DynPL & DynPL R-Phe & DynPL DynPL DynPL R-Phe
F=Full Codes, R=Rolled Codes. Bolded values indicate equivalent or superior performance compared to all baselines and the best performance is underlined.

Table 2: Problem Extraction Results

center Target Outcome & F-Phe & & R-Phe & Micro AU-ROC Macro AU-ROC Micro AU-ROC Macro AU-ROC Micro AU-ROC Macro AU-ROC Micro AU-ROC Macro AU-ROC Problem Extraction Bounceback Readmission 30-Day Readmission In-Hospital Mortality 30-Day Mortality

Table 1: Outcome Prediction Results

For our model to be interpretable, it is important for the problem extraction model to be effective. To explore the performance of our problem extraction model and the effect that the additional learning objective has on that performance, we conduct an additional experiment where we train our problem extraction model independently and compare it with the performance of our intermediate problem extraction model in our framework across all outcomes. We report results for this experiment in Table 2.

We observe that our problem extraction method is performant across all of the target outcomes in this work. However, we find that our problem extraction model is consistently more effective when using rolled codes as opposed to full sets of codes. This is understandable as the larger label space and finer grained distinctions between the codes leads to a more challenging classification problem. This reduced problem extraction performance when using the full set of codes is likely a contributing factor to the poorer target outcome performance observed when using full sets of codes.

We do observe that the addition of the target outcome objective does degrade performance when compared to a model trained exclusively on problem extraction. This degradation demonstrates the importance of our training procedure to ensure that the intermediate problem extraction remains effective.

5.4 Effect of End-to-End Training

We conduct an ablation experiment to evaluate the effect of end-to-end training on our framework’s performance by first training our framework only on problem extraction, freezing the problem extraction component, and then fine-tuning the final classification layer to predict the outcome of interest. We report results for this experiment in Table 3 and observe a consistent decrease in performance when training the two components separately. This decrease is particularly notable for both mortality outcomes. This is likely because the feature space defined by the problems fail to represent all pertinent information from the notes and training the network end-to-end allows for some adaptation to the final outcome. For example, the frozen problem extraction model would not be incentivized to recognize the severity of problems while such information would be useful when predicting the target outcomes.

center Model Problem Set In-Hospital Mortality 30-Day Mortality AU-ROC AU-PR AU-ROC AU-PR DynPL & Frozen DynPL & DynPL R-Phe & Frozen DynPL R-Phe & Model Problem Set Bounceback Readmission 30-Day Readmission AU-ROC AU-PR AU-ROC AU-PR DynPL & Frozen DynPL & DynPL R-Phe & Frozen DynPL R-Phe &

Table 3: Effect of End-to-End Training

5.5 Comparison Against Oracle

We conduct an additional experiment to explore the effectiveness of our problem extraction model. In this experiment we train a logistic regression oracle to predict the outcomes directly from the ground truth labels derived from ICD codes. It is important to note that because ICD codes are associated with entire hospital stays in our dataset, this experiment involves using future information compared to the clinically useful application setting of our other models. Not only are ICD codes themselves unavailable at the time of ICU discharge, but the codes could represent medical problems or procedures that arise or occur later in a patient’s hospital stay after the patient is discharged from the ICU.

Nevertheless, this experiment can provide some insight into the effectiveness of our problem extraction model and whether it is currently a performance bottleneck. We report results for this logistic regression oracle across two of our problem configurations in Table 4. We find that using the ground truth labels leads to notably improved performance compared to our framework for the readmission outcomes, but actually leads to worse performance for most of the mortality outcomes. While the improvement for readmission outcomes can likely be attributed in part to the use of future information, the improvement likely also results from the improved accuracy of the problem labels, suggesting that the efficacy of our problem extraction model is a limiting factor in our framework’s performance. However, our framework is not reliant on any particular architecture for problem extraction and this experiment demonstrates that as advances continue to be made on the task of automated ICD coding, our framework will become increasingly viable. The worse performance for mortality outcomes again suggests that the problem space doesn’t perfectly represent all of the relevant information contained within the notes and highlights the importance of our end-to-end training regime which allows for some adaptation to the outcome of interest.

center Model Problem Set In-Hospital Mortality 30-Day Mortality AU-ROC AU-PR AU-ROC AU-PR DynPL & LR Oracle & DynPL R-Phe & LR Oracle R-Phe & Model Problem Set Bounceback Readmission 30-Day Readmission AU-ROC AU-PR AU-ROC AU-PR DynPL & LR Oracle & DynPL R-Phe & LR Oracle R-Phe &

Table 4: Comparison Against Oracle

5.6 Label Integrity

Although our framework’s problem extraction performance provides a straightforward way to validate the effectiveness of our problem extraction model, it is not a perfect method due to the nature of our ground truth labels. A number of past works have demonstrated that ICD codes are an imperfect representation of ground truth phenotypes in actual clinical practice (Chang2016; Benesch660; 10.2307/3768402; doi:10.1161/01.STR.30.1.56; doi:10.2105/AJPH.82.2.243; 10.1093/aje/kwh314; doi:10.1161/STROKEAHA.114.006316; doi:10.1161/CIRCOUTCOMES.113.000743; doi:10.1161/STROKEAHA.113.003408). A common trend observed in work exploring the accuracy of ICD codes is that they have strong specificity but poorer sensitivity. In other words, a patient assigned a given code very likely has the corresponding condition, but there are likely more patients with that condition than only the patients who were assigned that ICD code. Given that our dataset contains information gathered during routine clinical care, the ICD codes we use as ground truth labels in this work likely suffer from the same problem.

Because of this complication, perfect problem extraction performance, as evaluated by using ICD codes as ground truth labels, is actually suboptimal. In such a case, the model would have learned to perfectly replicate the biases and mistakes in the ICD coding process instead of correctly identifying all of the clinical problems. We hypothesize that if our problem extraction model is effective, then there are likely some ’incorrect’ predictions that count against our model in the evaluation above that are actually correct. To evaluate this hypothesis, we conduct an expert evaluation over a limited set of predictions.

Because ICD codes tend to have problems with sensitivity, most of the errors with our ICD labels should be false negatives. To evaluate whether our problem extraction model is correctly recognizing some of the problems missed by the ICD codes, we extract the most confident false positives for one of the models trained in this work and manually evaluate whether the patient actually has the corresponding problem. It is important to note that when conducting the evaluation, we are not necessarily following ICD coding standards. We are instead identifying whether the patient has the corresponding problem to explore challenges with using ICD codes to represent phenotype labels as is being done in this work and has been done in prior work (rodriguez2018phenotype). We report the results for this experiment in Table 5.

Count Percentage
Correct Prediction
Correct Label
Table 5: Expert Evaluation of 50 False Positives

We observe that our hypothesis was correct and that a large majority () of the false positives that we extracted from our model were actually correct predictions penalized due to label inaccuracies. This demonstrates that our model is already reasonably robust to these label inaccuracies and is successfully extracting problems despite noisy labels. We also observe that the actual false positives are often well grounded in the text. For example, radiologists prioritize sensitivity over specificity when reporting observations, and we found multiple false positives resulting from radiological findings that required clinical correlation. Although there is a large body of work in ICD code classification in MIMIC (caml; soa; elhadad; multi_icd), we are the first to conduct such an analysis demonstrating the ability of our model to overcome label inconsistencies.

6 Interpretability

While we demonstrated that our framework is performant, its primary strength is the simplicity of interpretation that it provides. explainable surveyed clinicians to identify aspects of explainable modeling that improve clinician’s trust in machine learning models. Clinicians identified being able to understand feature importance as a critical aspect of explainability so that they can easily compare the model’s decision with their clinical judgement. Clinicians expected to see both global feature importance and patient-specific importance so we explore both of those in this work.

6.1 Global Trends

A large body of prior work has explored the interpretability of attention, but that exploration is typically limited to individual predictions (caml; show_attend_and_tell; hu). While that is useful, it is also important to gain an understanding of population level trends.

By designing our frameworks such that the value for the final prediction is a linear combination of the extracted problem scores, we can simply extract the weights from the final layer of our model to gain an understanding of which problems are important. We calculated the mean and standard deviation for each problem over the five folds and present the strongest risk factors across all outcomes in Table 6. We observe that there are a number of common risk factors between outcomes. We find that the top four risk factors for both readmission tasks were fluid disorders; puncture of vessel; renal failure; and congestive heart failure, not hypertensive. We find that urinary tract infections and pneumonia were both strong factors for mortality as well as the shared readmission risk factors of puncture of vessel and fluid disorders.

We also explored whether there were factors associated with healthy outcomes but found that even the most negative weights had a small magnitude that was insignificant given their variance. Thus our model appears to recognize a limited number of positive risk factors while the majority of the intermediate problems have little effect on the outcome. This makes it well-suited for producing clutter-free problem lists for clinicians which we explore in the next section.

center 30-Day Mortality In-Hospital Mortality Problem Weight Problem Weight Disorders of fluid, electrolyte, and acid-base balance Disorders of fluid, electrolyte, and acid-base balance Puncture of vessel Urinary tract infection Pneumonia Puncture of vessel Urinary tract infection Renal failure Congestive heart failure; nonhypertensive Pneumonia 30-Day Readmission Bounceback Readmission Problem Weight Problem Weight Disorders of fluid, electrolyte, and acid-base balance Disorders of fluid, electrolyte, and acid-base balance Renal failure Puncture of vessel Puncture of vessel Congestive heart failure; nonhypertensive Congestive heart failure; nonhypertensive Renal failure Other anemias Hypertension

Table 6: Risk Factors for Target Outcomes

center High-Risk Bounceback Readmission Patient Problem

Extraction Probability

Problem Weight Top Two Spans of Attended Text Other operations of abdominal region (includes paracentesis) [to attempt paracentesis again today]
[suitable for paracentesis was marked]
Chronic liver disease and cirrhosis [to attempt paracentesis again today]
[suitable for paracentesis was marked]
Injection or infusion of therapeutic or prophylactic substance [started on tpn plan was]
[remains on tpn at present]
Puncture of vessel [, beir hugger applied d/t low temp.;]
[reddend alovesta cream applied id : tmax]
Disorders of fluid, electrolyte, and acid-base balance [will be performed lft’s elevated being followed]
[pt is jaundiced , excoriated perianal area]
Septicemia [support , sepsis work-up p-will]
[levofloxacin and flagyl po skin]
Ascites (non malignant) [to attempt paracentesis again today]
[suitable for paracentesis was marked]
Transfusion of blood and blood components [pt had egd this pm]
[rec’d # units ffp with]
Prophylactic vaccination and inoculation against certain viral diseases [support , sepsis work-up p-will]
[history of hepatorenal failure and]
Chronic ulcer of skin [, beir hugger applied d/t low temp.;]
[reddend alovesta cream applied id : tmax]
Renal failure [s/p now with renal failure reason for]
[s/p now with renal failure reason for]
Peritonitis and retroperitoneal infections [to attempt paracentesis again today]
[suitable for paracentesis was marked]
Other anemias [rec’d n units ffp with]
[rec’d n unit ffp with]
Viral hepatitis [status , lactulose prn as]
[remains on lactulose prn to]
Low-Risk Bounceback Readmission Patient (Truncated) Diagnostic procedures on small intestine [presently another endoscopy is scheduled]
[had an endoscopy which revealed]
Other anemias [nnd unit prbc infusing presently]
[n unit prbc with initial]
Diseases of esophagus [presently another endoscopy is scheduled]
[had an endoscopy which revealed]
Effects radiation NOS [nnd unit prbc infusing presently]
[n unit prbc with initial]

Table 7: Dynamic Problem Lists

center Highly Attended Text High-Risk Bounceback Readmission Patient Low-Risk Bounceback Readmission Patient [radiology to attempt paracentesis again today] [small amts ice chips awaiting nnd endoscopy] [iv bid old tap site from] [understanding of discharge instructions and new] [planning to do tap this evening] [daughters care discharge instructions reviewed with] [further oozing needs c-diff spec pmicu nursing] [fbleeding noted discharge instructions , pt] [was d/cd a paracentesis was attempted] [ice chips per team neuro : a&oxn] [of ice chips tpn infusing as] [taking medication discharge planning complete with] [overnight mushroom cath draining loose brown-green stool] [scheduled for this am- ? nam pt] [was started on tpn plan was] [of chron’s disease and lower] [pt remains on tpn at present] [, denies sob rr nn-nn] [status , lactulose prn as] [, dry , intact without reddness or] [remains on lactulose prn to] [up the clots pt transferred] [re-oriented rec’ing lactulose po has] [chron’s disease and lower gib , now] [pt given lactulose x n] [in the ¡loc¿ area plan : repeat] [on po lactulose perl ,] [given iv erythromycin and iv]

Table 8: Baseline Attention Interpretation

6.2 Individual Predictions

We construct dynamic problem lists by extracting the 14 strongest problem predictions. We chose to extract 14 problems because the patients in the training fold had an average of codes assigned to their hospital stay so 14 problems should provide an adequate summary of the patient’s state. We report these problems sorted by their extraction probability and also report the importance of each problem for the final outcome so that clinicians can easily identify what factors are driving the prediction. For the problem importance, we scale the problem weights to the range by dividing by the problem weight with the greatest magnitude to allow for easier interpretation, and we also provide the spans of text attended to by the model to make each problem prediction. To provide a comparison using our baseline convolutional attention model, we extract the spans of text with the greatest attention weights associated with them.

We provide an example of a dynamic problem list for a patient predicted to be at high risk of bounceback readmission in Table 7. From looking at the dynamic problem list, we can quickly identify the most important problems driving the risk prediction (puncture of vessel, fluid disorder, renal failure, skin ulcer, intravenous feeding, and liver disease) while understanding that the other problems are insignificant. Reporting the quantitative importance of each problem saves the clinician from having to manually filter through the long list of problems. Furthermore, the extraction probability provides a measure of uncertainty which, along with the attended text, allows clinicians to intelligently verify the model’s performance. For example, renal failure is an important risk factor but has a relatively low extraction probability of . Upon inspecting the highlighted text, the clinician can clearly observe that the extraction was accurate and the patient is suffering from that condition. It is also worth noting that in this example the problem extraction model was able to successfully recognize that the patient had bed ulcers and a platelet transfusion, both of which are not represented by the ICD labels in the dataset.

By comparison, we provide the baseline visualization from the convolutional attention model for the same patient in Table 8. Here, we can only observe much broader trends and there is a large degree of redundancy (e.g. paracentesis and tap refer to the same procedure). We can observe that the patient has severe liver problems from the need for paracentesis and the use of the medication, lactulose. We can also observe that the patient required intravaneous feeding from the references to total parenteral nutrition (TPN). However, there is a significant amount of redundancy and it is not clear how to meaningfully aggregate these observations to actually gain an understanding of what clinical outcomes the model is extracting and how important they are for the final outcome. Furthermore, the overview of the patient is much less comprehensive than that provided by the dynamic problem list, with all of the information extracted by the baseline being concisely aggregated into three codes (Chronic liver disease and cirrhosis, Other operations of abdominal region, and Injection or infusion of therapeutic or prophylactic substance) in the dynamic problem list that quantitatively reports the importance of those conditions.

We compare a dynamic problem list to our baseline for a low-risk bounceback patient in the same tables and find that the benefits are even more pronounced. When examining the baseline visualization we observe that the model is primarily focusing on references to discharge instructions which don’t actually convey any clinically meaningful information. Similarly, the other phrases attended to do not seem to convey any important medical information. On the other hand, the dynamic problem list for the low-risk patient still effectively extracts clinical conditions (that the patient had an esophageal disease, was anemic, etc.) and then concludes that the extracted conditions do not warrant concern. This clearly demonstrates to a clinician that the model is still effectively extracting the patient’s clinical condition, but that it judges that condition to be safe. This transparency is important for the clinician to be able to trust that the model is effective.

7 Qualitative Expert User Study

While we have argued for the improved utility of our framework compared against recent work within the domain, it is important to verify that claim by conducting a user study with medical experts. For example, it may be possible that while our framework is sound in theory, the problem extraction stage is sufficiently noisy to render the extracted problem lists useless in practice. To examine the utility of our framework, we recruited four medical experts and conducted a user study where our experts evaluated the utility of our dynamic problem list and the baseline interpretation method. Three of our experts are currently practicing physicians while one is an MD-PhD student with one year of medical school remaining. Two of the medical experts are co-authors who were involved in some parts of the development of this work while the other two had no involvement with our work beyond taking part in the user study.

We conducted our user study by randomly sampling ICU stays from the test set of one of our day readmission models. Because of the imbalanced nature of our dataset, we sample stays from the top of predicted risks and sample the other stays from the remaining ICU stays. This ensures that we evaluate our framework for both high risk patients and patients that are representative of the general patient population. We then provided each of our expert reviewers with the clinical notes associated with each patient and instructed them to briefly review them to gain an understanding of the patient’s medical condition. We then presented them with our dynamic problem list and the baseline attention extraction along with the predicted readmission risk and the reviewers evaluated both methods independently using the Likert Scale seen in Figure 9.

We report the results for this study in Table 10

and compute the statistical significance for two comparisons. We examine the relationship between the two interpretation methods using a two-tailed paired t-test and also explore whether the dynamic problem list is meaningfully better than a neutral rating using a two-tailed one sample t-test. The first comparison allows us to examine whether our method is an improvement over the baseline while the second allows us to evaluate whether the medical expert’s judged our method favorably. We observe that every expert found our framework to be more effective than the baseline method and the difference was statistically significant for all but one expert. Additionally, every expert found the problem list to be meaningfully better than a neutral rating by a statistically significant margin. By contrast, two of our experts rated the baseline worse than neutral and none of the experts rated it to be better than neutral by a statistically significant margin. When averaging the scores for each patient across all experts, we find that our method received a rating of

on average compared to for the baseline method, a meaningful improvement over both the baseline () and a neutral rating (). These improvements are still significant even when limiting the evaluation to the two external experts to account for potential biases from the experts who were familiar with this work. While a much more stringent evaluation would need to be conducted (such as a randomized controlled trial) before implementing our method in practice, this preliminary qualitative evaluation is promising and more rigorous evaluations are left to future work.

center The list effectively identifies and presents relevant medical factors for evaluating readmission risk for this patient. Strongly Disagree Disagree Neutral Agree Strongly Agree 1 2 3 4 5

Table 9: Likert Scale

center Medical Expert 1 2 3 4 Average Average of External Experts Convolutional Attention Dynamic Problem List DynPL Conv-Attn DynPL Neutral

Table 10: User Study

8 Limitations and Future Work

We did not make the problem extraction architecture a large focus of this work and instead used a model representative of the recent state-of-the-art. In the future, we intend to improve upon the problem extraction module in our framework. In particular, we intend to explore whether we can utilize pre-trained language models to improve our problem extraction and downstream performance given their recent success across a wide variety of tasks both outside of and within the clinical domain (devlin-etal-2019-bert; alsentzer-etal-2019-publicly). In this work, we augmented our problem extraction module with a linear layer for its simplicity of interpretation and found that it led to strong performance. However, incorporating our problem extraction module into a more sophisticated model could potentially lead to meaningful improvements in performance and we intend to pursue this in future work. We would also like to extend this framework to other outcomes of clinical interest such as sepsis or the onset of intubation to evaluate its ability to generalize beyond the outcomes examined in this work.

9 Conclusion

In this work we develop a framework to extract outcome-specific problem lists from the clinical narrative while jointly predicting the likelihood of that outcome. We demonstrate that our framework is both more performant and more transparent than competitive baselines. Although there is a large body of work that has utilized billing information for clinical modeling, we are the first to demonstrate that it can be dynamically extracted in clinically useful settings to develop performant models. We also conduct a novel analysis to demonstrate that our problem extraction model is robust to labeling errors found in real-world clinical data. By reducing the final decision to a linear model that uses interpretable intermediate problems, we easily extract risk factors associated with the outcomes studied in this work. We also utilize this improved transparency to produce dynamic problem lists which were viewed more favorably than a competitive baseline according to an expert user study.


Appendix A.

The output dimensionality of all of our convolutional filters is set to . We apply dropout with a probability of after the embedding layer and apply it with a probability of after the convolutional layer and before every linear layer. For our LSTM model we use 128 hidden units and similarly apply dropout with a probability of after the embedding layer and apply it with a probability of before the final prediction. All of our models were trained with an effective batch size of 32 (gradient accumulation was necessary for some of the larger models) using a learning rate of

with the Adam optimizer and are trained using early stopping based on their performance on the validation set. We train each model for a maximum of 100 epochs and stop training early when the AU-ROC for our target outcome has not improved for 10 epochs with stable problem extraction performance. We then evaluate the model with the best validation performance as measured by the AU-ROC on the test set. All of our hyperparameters were tuned based on validation performance.