With the rapid deployment of electronic health records (EHRs) in the US, clinicians routinely enter patient data electronically, mostly in unstructured, free-text clinical notes. Some key details of the patient’s clinical assessment and medical history are stored almost exclusively in these notes, making them an important source of information for a number of downstream applications, such as clinical trial recruitment, billing, and predictive modeling. However, certain characteristics of clinical notes make automated parsing a challenge: clinicians employ many non-standard, ambiguous shorthand phrases (for example, “af” for “afebrile” or “atrial fibrillation”) and organize notes in unpredictable ways. These challenges make secondary use of free-text EHR data difficult, often requiring manual chart abstraction by trained staff to pull key information from notes. Traditional natural language processing (NLP) techniques, which often rely on hand-crafted rules(Taggart et al., 2018), grammatical assumptions, and feature engineering techniques such as parse trees or dictionaries, can be difficult to apply in this messy, irregular data regime.
In practice, machine learning models tend to make more use of structured fields such as medications and diagnoses that can be straightforwardly extracted from the EHR(Pencina et al., 2016). Clinical notes are often ignored outright (Lipton et al., 2015; Choi et al., 2016; Esteban et al., 2016; Nickerson et al., 2016; Pham et al., 2016; Choi et al., 2017; Che et al., 2018; Cheng et al., 2016; Nguyen et al., 2017), and models that do use notes frequently reduce them to an unordered set of words (Marafino et al., 2018; Jacobson and Dalianis, 2016; Rajkomar et al., 2018) or topics (Miotto et al., 2016; Suresh et al., 2017). This ignores many subtleties of language and context, which can have a large impact on the meaning of the text in a note. For example, consider the snippet “no family hx of diabetes; discharge diagnosis of cva.” A word-level approach might represent this text as follows: [“cva”, “diabetes”, “diagnosis”, “discharge”, “family”, “hx”, “no”, “of”, “of”]. In this representation, the context phrases “no family hx” and “discharge diagnosis” are no longer associated with “diabetes” or “cva”, respectively. Such context is necessary to accurately determine which of the two conditions applies to the patient described in the note.
Recent advances in deep learning have led to major improvements in a wide variety of NLP applications(Johnson et al., 2017; Devlin et al., 2018). Building on this work, we propose a model employing sequential, hierarchical, and pretraining (SHiP) techniques from deep NLP to improve EHR predictive models by automatically learning to extract relevant information from clinical notes. Specifically, our model employs a modified hierarchical attention network (Yang et al., 2016) to read clinical notes in sequence within the larger sequence of the patient’s medical history, preferentially attending to relevant portions of the text in each note. To enrich our model’s learned representation, we augment our training procedure with language model pretraining (Dai and Le, 2015): before optimizing the model for the prediction task, we train an auxiliary objective such that, for each word in the note, the notes-level model learns to predict the next word. To our knowledge, the effectiveness of language model pretraining has not been previously demonstrated for hierarchical classification models.
Our model reads clinical notes without assuming any particular layout, medical vocabulary, writing style or language rules in the text. By maintaining the sequential order of the words and abbreviations in the text, the model’s predictions can be informed by context that cannot be captured using keywords alone. We evaluate our model on standard classification tasks for EHRs, including identifying discharge diagnoses and predicting mortality risk, and compare the performance of this model against existing state-of-the-art baselines for these tasks (Rajkomar et al., 2018). We also evaluate the sensitivity of the model’s outputs to different phrases in the text using deep learning attribution methods (Sundararajan et al., 2017).
We developed our models using patient data from the Medical Information Mart for Intensive Care (MIMIC-III) database (Johnson et al., 2016; Pollard and Johnson, 2016), a research dataset of medical records collected from critical care patients at the Beth Israel Deaconess Medical Center between 2001 and 2012. We represented patients’ medical histories as a time series according to the Fast Healthcare Interoperability Resources (FHIR) specification, as described in previous work (Rajkomar et al., 2018). The study cohort included all patients in MIMIC-III hospitalized for at least 24 hours.
|Train & validation||Test|
|Number of patients||40,511||4,439|
|Number of hospital admissions*||51,081||5,598|
|Female||22,468 (44.0)||2,548 (45.5)|
|Male||28,613 (56.0)||3,050 (54.5)|
|Age, median (IQR)||62 (32)||62 (33)|
|Hospital discharge service,|
|General medicine||21,350 (41.8)||2,354 (42.1)|
|Cardiovascular||10,965 (21.5)||1,175 (21.0)|
|Obstetrics||7,123 (13.9)||803 (14.3)|
|Cardiopulmonary||4,459 (8.7)||519 (9.3)|
|Neurology||4,282 (8.4)||457 (8.2)|
|Cancer||2,217 (4.3)||223 (4.0)|
|Psychiatric||28 (0.1)||4 (0.1)|
|Other||657 (1.3)||63 (1.1)|
|None||40,362 (79.0)||4,415 (78.9)|
|One||6,427 (12.6)||721 (12.9)|
|Two to five||3,681 (7.2)||397 (7.1)|
|Six or more||611 (1.2)||65 (1.2)|
|Home||28,991 (56.8)||3,095 (55.3)|
|Skilled nursing facility||6,878 (13.5)||794 (14.2)|
|Rehab||5,757 (11.3)||653 (11.7)|
|Other healthcare facility||3,830 (7.5)||448 (8.0)|
|Expired||4,420 (8.7)||462 (8.3)|
|Other||1,205 (2.4)||146 (2.6)|
|Number of discharge ICD-9, median (IQR)**||9 (8)||9 (8)|
** Includes only billable ICD-9 codes.
2.1 Data Extraction and Feature Choices
The set of features we extracted from the patient records comprised basic encounter information (admission type, status, and source), diagnosis and procedure codes, medication orders, quantitative observations (lab results and vital signs), and free-text clinical notes. For each continuous feature, values were standardized to Z-scores using the mean and standard deviation from the training set, with any outliers more than 10 standard deviations from the mean capped to a score of.
3.1 Classification tasks
For each hospitalization, we developed models for the following tasks using the full history up to the specified time in the current admission (including all past hospitalizations).
Inpatient mortality prediction
Whether the patient died during the current admission (defined as a discharge disposition of “expired”). Predicted 24 hours after the patient’s admission.
Primary discharge diagnosis
All discharge diagnoses
The full set of ICD-9 (Slee, 1978) billing codes associated with the patient’s discharge diagnoses, including any of 6,448 possible labels. Predicted at the moment of discharge.
3.2 Model Architecture
All models in our experiments shared a core embedding scheme and top-level LSTM architecture described in previous work (Rajkomar et al., 2018)
, differing only in how they handled text from clinical notes. For each other discrete feature type in the patient timeline (e.g. diagnosis codes), individual tokens were “embedded,” or represented as a low-dimensional vector to be randomly initialized and then trained jointly with the model. To reduce sequence length, observations were grouped into coarse-grained timesteps of one hour in length, which we refer to as unordered “bags,” and embeddings or values for observations of the same feature within the same bag were averaged; additionally, all observations occurring prior to the most recent 1000 timesteps (i.e. more than 1000 hours before the time of prediction) were grouped into a single bag. The averaged embeddings for each discrete feature, as well as the standardized values for each continuous feature, were then concatenated into a single representation of each hourly timestep in the patient history.
We fed the embedded sequence to a long short-term memory (LSTM) network, a type of recurrent neural network that computes activationsat each timestep as a nonlinear function of the current input embedding and the previous hidden state and cell state , according to the following equations (Hochreiter and Schmidhuber, 1997):
denotes the sigmoid function,denotes the hyperbolic tangent function, and denotes the Hadamard (elementwise) product. The final hidden state of the LSTM was passed through a feedforward output layer to generate predictions, with a sigmoid activation for mortality or ICD-9 prediction or a softmax activation for primary CCS prediction. We trained the model to minimize the cross-entropy loss on the ground-truth labels.
We compared the following variants of this model in our experiments:
The free-text notes were not included, and the model was trained exclusively on the other elements of the record.
Notes were included in the record, but treated just as any other discrete feature. We tokenized text at the word level, converted to lowercase, and stripped all punctuation. Individual words were then embedded, and all word embeddings within each hourly bag (which may contain several notes) were simply averaged together, ignoring word ordering and nearby context. We included a variant of this model that also uses bigrams, or pairs of adjacent words: the bigram strings were hashed to buckets (where is the unigram vocabulary size), and then embedded, bagged, and concatenated with the bagged unigrams.
For each note, we embedded the individual words as in the bag-of-words model, but maintained the sequential order of embeddings within the note. We fed the embedded notes to a second LSTM, which reads the terms sequentially to generate a context-sensitive vector representation at each word. For the notes, we experimented with both unidirectional and bidirectional LSTMs (see Appendix): the latter processes the sequence in both the forward and reverse directions and concatenates the hidden states at each timestep, so that each output from the LSTM incorporates both previous and future context. We computed the final output vector for each note by aggregating the hidden states for each word according to a learned attention weighting, which places higher weight on the portions of the notes that are most important for the downstream prediction. Specifically, we used a slightly modified version of the hierarchical attention network Yang et al. (2016): the model computes the dot product of a query vector with each hidden state and normalizes via a softmax function to obtain the attention weighting over the sequence, augmented with an additional prior embedding vector and corresponding scalar bias weight (where , and are learned jointly with the model during training):
As before, outputs for individual notes within the same hourly bag were averaged together, and the bagged note vectors were concatenated with the remaining feature vectors for input into the record-level LSTM.
SHiP (Sequential, Hierarchical and Pretrained)
For this variant of the hierarchical LSTM, we augmented the standard training procedure with unsupervised language model pretraining (Dai and Le, 2015) for the note-level LSTM: before optimizing the cross-entropy loss, we trained an auxiliary objective such that, for each word in the note, the forward LSTM learned to predict the next word (and if bidirectional, the backward LSTM learned to predict the previous word). Note that while the mortality models were restricted to the first 24 hours of data from the current hospitalization (as well as any data from previous admissions) during training and evaluation, we pretrained these models over the full set of notes from the hospitalization up to discharge. We also found that the hierarchical mortality models performed best when, after pretraining, the notes LSTM weights were then frozen during the standard training phase.
For memory and performance reasons, in all hierarchical models we restricted the maximum amount of text used in the notes LSTM, keeping the most recent tokens per record (across all notes) and discarding any additional leading tokens. We tuned the level of truncation on the validation set, and found to be sufficient for training mortality models, but increased to for both diagnosis tasks and for pretraining. For comparison, we also trained variants of all models with notes as the only feature available to the model, excluding the other elements of the record.
3.3 Attribution Methods
To compute attribution scores over the text of notes, we use the path-integrated gradients technique (Sundararajan et al., 2017). For clarity in these attributions, we ran a notes-only model over only the selected note, omitting the rest of the notes in the patient’s record. For a note with word embeddings for each word , we define the gradient attribution as the gradient of the model output for the class of interest (e.g. the patient’s true diagnosis) with respect to the word embedding :
The integrated gradients attribution, then, is the integral of these gradient attributions along the straight-line path between a baseline word embedding (which we choose to be the zero vector) and the learned word embeddings. If we approximate the integral using steps, this can be computed as:
In practice, we use steps. This score offers a first-order approximation of the effect of each input on the relevant model output. Integrated gradients are straightforward to implement and satisfy certain properties not guaranteed by standard gradient attribution (Sundararajan et al., 2017). The use of gradient-based attribution in text models is still an active area of research.
4.1 Training and Evaluation Approach
We split our patient cohort according to patient ID into 80% train, 10% validation, and 10% test splits. For each task, we tuned learning rate, LSTM size, and dropout hyperparameters for the record-level LSTM using the full-feature BOW models. We transferred these hyperparameters to the SHiP models, and then performed additional tuning of the learning rate and of corresponding size and dropout parameters for the notes-level LSTM. Models were optimized using Adam(Kingma and Ba, 2015) and dropout techniques applied included standard input and hidden-layer dropout (Srivastava et al., 2014), variational input and hidden-layer dropout and vocabulary dropout (Gal and Ghahramani, 2016), and zoneout (Krueger et al., 2016). For the SHiP models, all dropout techniques were applied during both pretraining and training. We used a Gaussian process bandit optimization algorithm (Desautels et al., 2014) to search for and select hyperparameters maximizing performance for each task on the validation set. Metrics for model selection included AUROC (for mortality and ICD9) and top-5 recall (for CCS). For predicting ICD9 codes, a multilabel task, we computed a weighted AUROC, where the AUROC for each label is averaged according to the label’s prevalence.
Following hyperparameter tuning, we trained each model five times from different random initializations for each task. We used early stopping to select the best model for each run according to validation set performance, and we report the mean and standard deviation of the test set performance over all five runs. We also compute and report the statistical significance of the change in performance between the best baseline and best hierarchical model for each task, and between the best hierarchical models with and without pretraining, using a two-tailed Welch’s t-test. All models were implemented in Tensorflow 1.12(Abadi et al., 2016)
, and trained on Nvidia Tesla P100 GPUs. Evaluation metrics and statistical tests were calculated using scikit-learn 0.20(Pedregosa et al., 2011).
4.2 Model Performance
|Model||Mortality||Primary CCS||All ICD-9|
|AUROC||AUPRC||Top-5 Recall||F1||Weighted AUROC||AUPRC|
|No notes||-||0.869 (0.001)||0.449 (0.006)||0.793 (0.004)||0.521 (0.008)||0.869 (0.001)||0.297 (0.001)|
|Bag-of-words||Unigrams (notes only)||0.832 (0.003)||0.383 (0.004)||0.849 (0.002)||0.591 (0.004)||0.880 (0.001)||0.328 (0.002)|
|Unigrams (all features)||0.880 (0.001)||0.479 (0.008)||0.842 (0.002)||0.587 (0.002)||0.880 (0.001)||0.321 (0.001)|
|Unigrams and bigrams (all features)||0.872 (0.002)||0.460 (0.005)||0.829 (0.001)||0.585 (0.003)||0.878 (0.001)||0.315 (0.001)|
|Hierarchical (without pretraining)||Notes only||0.825 (0.003)||0.351 (0.003)||0.850 (0.001)||0.606 (0.003)||0.887 (0.002)||0.345 (0.005)|
|All features||0.876 (0.003)||0.471 (0.006)||0.812 (0.014)||0.555 (0.020)||0.869 (0.004)||0.291 (0.010)|
|SHiP||Notes only||0.825 (0.004)||0.353 (0.005)||0.897* (0.003)||0.667* (0.006)||0.891 (0.001)||0.352 (0.001)|
|All features||0.882 (0.001)||0.479 (0.007)||0.887 (0.003)||0.660 (0.004)||0.889 (0.002)||0.332 (0.016)|
for difference compared to best bag-of-words model.
Table 2 compares the performance of all model variants on the selected prediction tasks. The SHiP models significantly improved over the BOW baselines on the two diagnosis tasks ( under Welch’s t-test): for CCS prediction, the best SHiP model improved top-5 recall by 4.8 percentage points and F1 by 7.6 percentage points over the best BOW model; for ICD-9 prediction, weighted area under the ROC curve (AUROC) increased by 1.1 percentage points and area under the precision-recall curve (AUPRC) increased by 2.4 percentage points. For mortality prediction, we saw a negligible benefit from the SHiP architecture, with only 0.2 percentage point increase in AUROC and no change in AUPRC.
The SHiP models also improved over the corresponding hierarchical attention networks without pretraining. For mortality, pretraining the all-features model increased AUROC by 0.6 percentage points () and AUPRC by 0.8 percentage points (); for primary CCS, pretraining the notes-only model increased top-5 recall by 4.7 percentage points () and F1 by 6.1 percentage points (); for all ICD-9, pretraining the notes-only model increased weighted AUROC by 0.4 percentage points () and AUPRC by 0.7 percentage points ().
4.3 Qualitative Analysis
Figure 3 shows examples of path-integrated gradients attribution from a primary CCS model, over discharge summaries from different patients. We observe that the SHiP model frequently concentrates on just one or a few important phrases, even in very long notes. The choice of phrase is often informed by the nearby context: for example, we can see that the SHiP model is consistently most sensitive to the clinically-relevant words following the phrase “discharge diagnoses.” In fact, in the first and second notes, the patient’s diagnosis is restated elsewhere in the text in a less relevant context (e.g. stating that the patient has “no family history” of diabetes), but the model is sensitive only to the instance where the discharge context is made explicit. The bag-of-words model, by contrast, is incapable of making such contextual distinctions, and is generally more sensitive to key words and phrases throughout the text. We attempted a similar analysis using standard gradients attribution, but found the outputs to be noisier and less interpretable for both models.
5 Discussion and Related Work
The results of this study demonstrate that SHiP, a novel combination of hierarchical modeling of clinical notes with language model pretraining, can improve discharge diagnosis classification over previous state-of-the-art models, with only minimal preprocessing of text. SHiP models process clinical notes in a way that is more sensitive to the context and structure of language compared to other common approaches, which often reduce notes to a set of keywords. Hierarchical recurrent networks (Yang et al., 2016; Chung et al., 2016; Hwang and Sung, 2017; Yu et al., 2016; Meng et al., 2017) and pretraining methods (Dai and Le, 2015; Devlin et al., 2018; Yang et al., 2019) have each individually proven successful in a wide variety of general NLP applications; here, we show the utility of these methods applied jointly, and specifically within a clinical context.
This work builds on recent literature on applying deep learning techniques to analysis of electronic health records data (Shickel et al., 2017). Many of the previous advances in deep learning for clinical NLP have relied on more standard convolutional or recurrent architectures rather than hierarchical approaches (Jagannatha and Yu, 2016; Vani et al., 2017; Mullenbach et al., 2018; Gehrmann et al., 2018; Chokwijitkul et al., 2018). However, a number of studies have recently begun applying hierarchical models to clinical text (Gao et al., 2017; Baumel et al., 2017; Samonte et al., 2018; Liu et al., 2018; Newman-Griffis and Zirikly, 2018) as well as other data modalities (Sha and Wang, 2017; Phan et al., 2018). These existing hierarchical text models do not utilize language model pretraining, generally relying on more limited techniques such as pretraining to approximate a weighted bag of word embeddings (Gao et al., 2017) or using pretrained word embeddings only (Liu et al., 2018; Newman-Griffis and Zirikly, 2018). Our results show that language model pretraining can substantially improve hierarchical models, and may in some cases be required to outperform strong bag-of-words baselines. This pretraining method may allow the hierarchical model to better learn long-term dependencies between words in a note and better use contextual information. The sequential processing of the note and path-integrated gradients also allow improved visualization of the parts of each note most relevant to a particular prediction. Future research might explore how this hierarchical pretraining framework could be extended to other features with sequential structure, such as vital signs.
This study also touches upon a broader question of when notes provide additional predictive value, relative to other parts of the medical record. Previous studies have found mixed evidence: for example, TF-IDF weighted unigrams extracted from notes were shown to have only moderate discriminative utility (and less than other elements of the clinical record) for predicting readmission (Walsh and Hripcsak, 2014); while applying an LSTM to notes represented as restricted bags of words was shown to improve performance on other tasks such as predicting diagnosis or length of stay (Boag et al., 2018). Our experiments also suggest a task-dependent pattern in the predictive value of notes. For all-cause mortality risk, we found that notes provided less predictive value compared especially to quantitative signals like labs or vitals. On the other hand, the SHiP models delivered clear improvements on the diagnosis classification tasks, likely because notes often contain rich diagnostic information – for example, discussions of differential diagnosis with qualifiers that cannot be easily captured in other forms. The difference was more pronounced for the primary CCS task; while ICD-9 prediction also benefited, the task of predicting several (possibly noisy) labels per patient is likely harder.
Our study has some important limitations. First, our analysis was limited to a single ICU patient population, and should be validated using other patient cohorts from other health centers. However, we note that nothing about our modeling approach is site-specific, and indeed its design should directly accommodate the particular note-writing habits at any institution. Second, although our proposed architecture can jointly model multiple data modalities, we found that our diagnosis models performed best with notes alone, and were prone to overfitting when provided additional features. Similarly, we hypothesized that adding bigrams to the BOW models might offer a simple approach to increasing the linguistic context available to the model, but found that this harmed generalization across tasks. Despite employing several common regularization techniques, we consistently observed a wider generalization gap (between training and validation set performance) when training the diagnosis models using more features. We hypothesize that it may be difficult to avoid overfitting to extraneous features and that our method may be better suited to larger patient cohorts, but this phenomenon invites further investigation. Third, our experiments included only a small subset of possible tasks; future research on a wider range of tasks might advance understanding of when and why notes are useful for prediction. Fourth, our experimental setup tests prediction at a single time point, when in fact in deployment the model will likely need to be adapted for continuous prediction. This remains an open area of research.
We demonstrate the effectiveness of techniques from deep NLP, particularly language model pretraining and hierarchical attention networks, for improved modeling of clinical notes. Our work provides a flexible and general approach that can be readily applied to clinical text from any source, for any modeling task where the unstructured information in the text is critical to understanding the outcome of interest.
We thank Nissan Hajaj and Xiaobing Liu for developing the core framework used to implement our models. We thank Gerardo Flores, Kathryn Rough, and Kun Zhang for providing assistance with our data processing and evaluation pipelines. We thank Kai Chen, Michael Howell, and Denny Zhou for their comments and feedback on this manuscript.
- Taggart et al.  Maxwell Taggart, Wendy W Chapman, Benjamin A Steinberg, Shane Ruckel, Arianna Pregenzer-Wenzler, Yishuai Du, Jeffrey Ferraro, Brian T Bucher, Donald M Lloyd-Jones, Matthew T Rondina, and Rashmee U Shah. Comparison of 2 natural language processing methods for identification of bleeding among critically ill patients. JAMA Netw Open, 1(6):e183451–e183451, October 2018.
- Pencina et al.  Michael J Pencina, Benjamin A Goldstein, Ann Marie Navar, and John P A Ioannidis. Opportunities and challenges in developing risk prediction models with electronic health records data: a systematic review. Journal of the American Medical Informatics Association, 24(1):198–208, 05 2016. ISSN 1067-5027. doi: 10.1093/jamia/ocw042. URL https://doi.org/10.1093/jamia/ocw042.
- Lipton et al.  Zachary C. Lipton, David C. Kale, Charles Elkan, and Randall Wetzel. Learning to Diagnose with LSTM Recurrent Neural Networks. arXiv e-prints, art. arXiv:1511.03677, Nov 2015.
- Choi et al.  Edward Choi, Mohammad Taha Bahadori, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Doctor AI: Predicting clinical events via recurrent neural networks. In Machine Learning for Healthcare Conference, pages 301–318, December 2016.
- Esteban et al.  Cristobal Esteban, Oliver Staeck, Stephan Baier, Yinchong Yang, and Volker Tresp. Predicting clinical events by combining static and dynamic information using recurrent neural networks. In 2016 IEEE International Conference on Healthcare Informatics (ICHI), 2016.
- Nickerson et al.  Paul Nickerson, Patrick Tighe, Benjamin Shickel, and Parisa Rashidi. Deep neural network architectures for forecasting analgesic response. Conf. Proc. IEEE Eng. Med. Biol. Soc., 2016:2966–2969, August 2016.
- Pham et al.  Trang Pham, Truyen Tran, Dinh Phung, and Svetha Venkatesh. DeepCare: A deep dynamic memory model for predictive medicine. In Advances in Knowledge Discovery and Data Mining, pages 30–41. Springer, Cham, April 2016.
- Choi et al.  Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. Using recurrent neural network models for early detection of heart failure onset. J. Am. Med. Inform. Assoc., 24(2):361–370, March 2017.
- Che et al.  Zhengping Che, Sanjay Purushotham, Kyunghyun Cho, David Sontag, and Yan Liu. Recurrent neural networks for multivariate time series with missing values. Sci. Rep., 8(1), 2018.
- Cheng et al.  Yu Cheng, Fei Wang, Ping Zhang, and Jianying Hu. Risk prediction with electronic health records: A deep learning approach. In Proceedings of the 2016 SIAM International Conference on Data Mining, 2016.
- Nguyen et al.  P. Nguyen, T. Tran, N. Wickramasinghe, and S. Venkatesh. : A convolutional net for medical records. IEEE Journal of Biomedical and Health Informatics, 21(1):22–30, Jan 2017. ISSN 2168-2194. doi: 10.1109/JBHI.2016.2633963.
- Marafino et al.  Ben J Marafino, Miran Park, Jason M Davies, Robert Thombley, Harold S Luft, David C Sing, Dhruv S Kazi, Colette DeJong, W John Boscardin, Mitzi L Dean, and R Adams Dudley. Validation of prediction models for critical care outcomes using natural language processing of electronic health record data. JAMA Netw Open, 1(8):e185097–e185097, December 2018.
- Jacobson and Dalianis  Olof Jacobson and Hercules Dalianis. Applying deep learning on electronic health records in swedish to predict healthcare-associated infections. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing, 2016.
- Rajkomar et al.  Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, Patrik Sundberg, Hector Yee, Kun Zhang, Yi Zhang, Gerardo Flores, Gavin E Duggan, Jamie Irvine, Quoc Le, Kurt Litsch, Alexander Mossin, Justin Tansuwan, De Wang, James Wexler, Jimbo Wilson, Dana Ludwig, Samuel L Volchenboum, Katherine Chou, Michael Pearson, Srinivasan Madabushi, Nigam H Shah, Atul J Butte, Michael D Howell, Claire Cui, Greg S Corrado, and Jeffrey Dean. Scalable and accurate deep learning with electronic health records. npj Digital Medicine, 1(1):18, May 2018.
- Miotto et al.  Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Sci. Rep., 6:26094, May 2016.
- Suresh et al.  Harini Suresh, Nathan Hunt, Alistair Johnson, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Clinical intervention prediction and understanding with deep neural networks. In Machine Learning for Healthcare Conference, pages 322–337, November 2017.
Johnson et al. 
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado,
Macduff Hughes, and Jeffrey Dean.
Google’s multilingual neural machine translation system: Enabling Zero-Shot translation.Transactions of the Association for Computational Linguistics, 5:339–351, 2017.
- Devlin et al.  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, art. arXiv:1810.04805, Oct 2018.
- Yang et al.  Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
- Dai and Le  Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Neural Information Processing Systems (NIPS), November 2015.
- Sundararajan et al.  Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pages 3319–3328. JMLR.org, 2017. URL http://dl.acm.org/citation.cfm?id=3305890.3306024.
- Johnson et al.  Alistair E W Johnson, Tom J Pollard, Lu Shen, Li-Wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. MIMIC-III, a freely accessible critical care database. Sci Data, 3:160035, May 2016.
- Pollard and Johnson  T J Pollard and A E W Johnson. The MIMIC-III clinical database. http://dx.doi.org/10.13026/C2XW26, 2016. Accessed: 2018-12-10.
- Elixhauser  Anne Elixhauser. Clinical Classifications for Health Policy Research, Version 2: Hospital Inpatient Statistics. 1996.
- Slee  Vergil N Slee. The international classification of diseases: Ninth revision (ICD-9). Ann. Intern. Med., 88(3):424, 1978.
- Hochreiter and Schmidhuber  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, December 1997.
- Kingma and Ba  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
- Srivastava et al.  Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958, 2014.
- Gal and Ghahramani  Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1019–1027, 2016.
- Krueger et al.  David Krueger, Tegan Maharaj, János Kramár, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations. arXiv e-prints, art. arXiv:1606.01305, Jun 2016.
- Desautels et al.  Thomas Desautels, Andreas Krause, and Joel W. Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. Journal of Machine Learning Research, 15:4053–4103, 2014. URL http://jmlr.org/papers/v15/desautels14a.html.
- Abadi et al.  Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv e-prints, art. arXiv:1603.04467, Mar 2016.
- Pedregosa et al.  Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12(Oct):2825–2830, 2011.
- Chung et al.  Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical Multiscale Recurrent Neural Networks. arXiv e-prints, art. arXiv:1609.01704, Sep 2016.
- Hwang and Sung  K. Hwang and W. Sung. Character-level language modeling with hierarchical recurrent neural networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5720–5724, March 2017. doi: 10.1109/ICASSP.2017.7953252.
- Yu et al.  Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. Video paragraph captioning using hierarchical recurrent neural networks. In
- Meng et al.  Zhao Meng, Lili Mou, and Zhi Jin. Hierarchical RNN with static Sentence-Level attention for Text-Based speaker change detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management - CIKM ’17, 2017.
- Yang et al.  Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv e-prints, art. arXiv:1906.08237, Jun 2019.
- Shickel et al.  Benjamin Shickel, Patrick James Tighe, Azra Bihorac, and Parisa Rashidi. Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE Journal of Biomedical and Health Informatics, pages 1–1, 2017.
- Jagannatha and Yu  Abhyuday Jagannatha and Hong Yu. Structured prediction models for RNN based sequence labeling in clinical text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016.
- Vani et al.  Ankit Vani, Yacine Jernite, and David Sontag. Grounded Recurrent Neural Networks. arXiv e-prints, art. arXiv:1705.08557, May 2017.
- Mullenbach et al.  James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. Explainable prediction of medical codes from clinical text. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1101–1111, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1100. URL https://www.aclweb.org/anthology/N18-1100.
- Gehrmann et al.  Sebastian Gehrmann, Franck Dernoncourt, Yeran Li, Eric T. Carlson, Joy T. Wu, Jonathan Welt, John Foote, Jr., Edward T. Moseley, David W. Grant, Patrick D. Tyler, and Leo A. Celi. Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PLOS ONE, 13(2):1–19, 02 2018. doi: 10.1371/journal.pone.0192360. URL https://doi.org/10.1371/journal.pone.0192360.
- Chokwijitkul et al.  Thanat Chokwijitkul, Anthony Nguyen, Hamed Hassanzadeh, and Siegfried Perez. Identifying risk factors for heart disease in electronic medical records: A deep learning approach. Proceedings of the BioNLP 2018 workshop, pages 18–27, 2018.
- Gao et al.  Shang Gao, Michael T Young, John X Qiu, Hong-Jun Yoon, James B Christian, Paul A Fearn, Georgia D Tourassi, and Arvind Ramanthan. Hierarchical attention networks for information extraction from cancer pathology reports. J. Am. Med. Inform. Assoc., November 2017.
- Baumel et al.  Tal Baumel, Jumana Nassour-Kassis, Raphael Cohen, Michael Elhadad, and No‘emie Elhadad. Multi-Label Classification of Patient Notes a Case Study on ICD Code Assignment. arXiv e-prints, art. arXiv:1709.09587, Sep 2017.
- Samonte et al.  Mary Jane C Samonte, Bobby D Gerardo, Arnel C Fajardo, and Ruji P Medina. ICD-9 tagging of clinical notes using topical word embedding. In Proceedings of the 2018 International Conference on Internet and e-Business - ICIEB ’18, 2018.
- Liu et al.  Jingshu Liu, Zachariah Zhang, and Narges Razavian. Deep EHR: Chronic Disease Prediction Using Medical Notes. arXiv e-prints, art. arXiv:1808.04928, Aug 2018.
Newman-Griffis and Zirikly 
Denis Newman-Griffis and Ayah Zirikly.
Embedding transfer for Low-Resource medical named entity recognition: A case study on patient mobility.Proceedings of the BioNLP 2018 workshop, pages 1–11, 2018.
- Sha and Wang  Ying Sha and May D Wang. Interpretable predictions of clinical outcomes with an attention-based recurrent neural network. In Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics - ACM-BCB ’17, 2017.
- Phan et al.  Huy Phan, Fernando Andreotti, Navin Cooray, Oliver Y. Chén, and Maarten De Vos. SeqSleepNet: End-to-End Hierarchical Recurrent Neural Network for Sequence-to-Sequence Automatic Sleep Staging. arXiv e-prints, art. arXiv:1809.10932, Sep 2018.
- Walsh and Hripcsak  Colin Walsh and George Hripcsak. The effects of data sources, cohort selection, and outcome definition on a predictive model of risk of thirty-day hospital readmissions. J. Biomed. Inform., 52:418–426, December 2014.
- Boag et al.  Willie Boag, Dustin Doss, Tristan Naumann, and Peter Szolovits. What’s in a note? unpacking predictive value in clinical note representations. AMIA Summits Transl. Sci. Proc., 2017:26, 2018.
|Pretrained to 24 hours||0.881 (0.001)||0.478 (0.005)|
|Pretrained to discharge||0.882 (0.001)||0.479 (0.007)|
|Mortality||Primary CCS||All ICD-9|
|AUROC||AUPRC||Top-5 Recall||F1||Weighted AUROC||AUPRC|
|Hyperparameters||Mortality||Primary CCS||All ICD-9|
|Gradient clip norm||37.5||37.5||0.125||0.125||0.125||0.125|
|Variational vocabulary dropout*||0.001||0.229||0.273||0.396||0.273||0.273|
|Record LSTM||Hidden units||379||518||518|
|Variational input dropout||0.034||0.071||0.071|
|Variational hidden dropout||0.090||0.122||0.122|
|Variational input dropout||0.176||0.291||0.156|
|Variational hidden dropout||0.061||0.085||0.103|