Clinical notes contain information about patients that goes beyond structured data like lab values and medications. However, clinical notes have been underused relative to structured data, because notes are high-dimensional and sparse. This work develops and evaluates representations of clinical notes using bidirectional transformers (ClinicalBert). ClinicalBert uncovers high-quality relationships between medical concepts as judged by humans. ClinicalBert outperforms baselines on 30-day hospital readmission prediction using both discharge summaries and the first few days of notes in the intensive care unit. Code and model parameters are available.READ FULL TEXT VIEW PDF
An ehr stores patient information; it can save money, time, and lives (Pedersen et al., 2017)
. Every day, more data gets added to an ehr, so analyses may benefit from machine learning. Machine learning techniques leverage structured features in ehr data, such as lab results and electrocardiography measurements, to uncover patterns and improve predictions(Shickel et al., 2018; Xiao et al., 2018a; Yu et al., 2018). However, unstructured, high-dimensional, and sparse information such as clinical notes are difficult to use in clinical machine learning models. Our goal is to create a framework for modeling clinical notes that can uncover clinical insights and make medical predictions.
Clinical notes contain significant clinical value (Boag et al., 2018; Weng et al., 2017; Liu et al., 2018; Afzal et al., 2018). A patient might be associated with hundreds of notes within a stay and over their history of admissions. Compared to structured features, clinical notes provide a richer picture of the patient since they describe symptoms, reasons for diagnoses, radiology results, daily activities, and patient history. Consider clinicians working in the intensive care unit, who need to make decisions under time constraints. Making accurate clinical predictions may require reading a large volume of clinical notes. This can add to a doctor’s workload—tools that can make accurate predictions based on clinical notes might be useful in practice.
. One estimate puts the financial burden of readmission at 17.9 billion dollars and the fraction of avoidable admissions at 76%(Basu Roy et al., 2015). Accurately predicting readmission has clinical significance both in terms of efficiency and reducing the burden on intensive care unit doctors.
We develop a discharge support model, cb, that processes a patient’s notes and dynamically assigns a risk score of whether the patient will be readmitted within 30 days (Figure 1). As physicians and nurses write notes about a patient, cb processes the notes and updates the associated risk score of readmission. This score can help care providers make informed decisions and intervene in advance if needed. cb is also readily adapted to other tasks such as diagnosis predictions, mortality risk estimation, or length of stay assessments.
Clinical notes use abbreviations, jargon, and have an unusual grammatical structure. Building models that learn useful representations of clinical text is a challenge.
Sager et al. (1995) frame representation learning for clinical notes as machine translation, translating unstructured text to representative sets of words. The bag-of-words model can be used for tasks dependent on individual words (Zhang et al., 2010). Log-bilinear word embedding models such as word2vec have also been used for learning representations of clinical notes (Mikolov et al., 2013; Pennington et al., 2014). Boag et al. (2018) study the performance of the bag-of-words model, word2vec, and a lstm model combined with word2vec on various tasks such as diagnosis prediction and mortality risk estimation. Word embedding models such as word2vec are trained using the local context of individual words, but as clinical notes are long and their words are interdependent (Zhang et al., 2018), these methods cannot capture long-range dependencies.
Natural language processing methods where representations include global, long-range information can yield a boost in performance on various tasks (Peters et al., 2018; Radford, 2018; Devlin et al., 2018). Clinical notes require capturing interactions between distant words. The need to model this long-range structure makes clinical notes suitable for contextual representations like in the bert model (Devlin et al., 2018). We develop cb by applying bert to clinical notes. Concurrent to our work, Lee et al. (2019) apply bert to biomedical literature and Alsentzer et al. (2019) apply bert on clinical notes and discharge summaries.
Methods to evaluate models of clinical notes are also relevant to cb. Wang et al. (2018); Chiu et al. evaluate the quality of biomedical embeddings by computing correlations between doctor-rated relatedness and embedding similarity scores. They also evaluate models through performance on downstream tasks such as information extraction. We adopt similar evaluation techniques in our work.
A good representation of clinical text requires good performance on downstream tasks. We use 30-day hospital readmission prediction as a case study since it is of clinical importance. We refer readers to Futoma et al. (2015)
for comparisons of traditional machine learning methods such as random forests and neural networks on hospital readmission tasks. The majority of work in this area has focused on integrating every possible covariate about a patient into a model.Xiao et al. (2018b)
use topic models combined with recurrent neural networks for interpretability and learn clinical concept embeddings for a readmission task.Caruana et al. (2015) develop an interpretable model for readmission prediction based on generalized additive models and highlight the need for intelligible clinical predictions. Rajkomar et al. (2018) predict readmission using Fast Healthcare Interoperability Resources codes from notes, alongside structured information. Most of this previous work uses information at discharge. In this work, we develop a model that can predict readmission dynamically.
cb shows improved readmission prediction over methods that center on discharge summaries. Making a prediction using a discharge summary at the end of a stay means that there are fewer opportunities to reduce the chance of readmission. To build a clinically-relevant model, we define a task for predicting readmission at any timepoint since a patient was admitted.
To evaluate models on the readmission prediction task, we define a metric motivated by a clinical challenge. Medicine suffers from alarm fatigue (Sendelbach and Funk, 2013). This means useful classification rules for medicine need to have high positive predictive value or precision. We evaluate model performance at a fixed positive predictive value. We show that cb has highest recall compared to other popular methods for representing clinical notes.
cb can be readily applied to other tasks such as mortality prediction and disease prediction. In addition, self-attention weight output by cb can be traced back to understand which elements of clinical notes were relevant to the current prediction. This can be used as an interpretability tool for clinicians.
We apply the bert model (Devlin et al., 2018) to clinical notes. Clinical notes are lengthy and numerous, and the computationally-efficient architecture of bert can model long-term dependencies. Compared to a popular model of clinical text, word2vec, cb more accurately captures clinical word similarity. We describe one way to scale up cb to handle large collections of clinical notes for clinical prediction tasks. In a case study of hospital readmission prediction, cb outperforms a deep language model. We open source cb pre-training and readmission model parameters, along with scripts to reproduce results.
cb learns deep representations of clinical text. These deep representations can be used to uncover clinical insights, such as predictions of disease, relationships between treatments and outcomes, or summaries of large volume of texts. cb is an application of the bert model (Devlin et al., 2018) to clinical texts; this requires several modifications to address the challenges intrinsic to clinical texts. Specifically, the representations are learned using medical notes and further processed for downstream clinical tasks. As an example, we use the clinical task of hospital readmission prediction.
bert is a deep neural network that uses the transformer encoder architecture (Vaswani et al., 2017) to learn embeddings for text. We omit a detailed description of the architecture; it is described in full in Vaswani et al. (2017). The transformer encoder architecture is based on a self-attention mechanism, and the pre-training objective function for the model is defined using two unsupervised tasks: masked language modeling and next sentence prediction. The text embeddings and model parameters are fit using stochastic optimization. For downstream tasks, the fine-tuning phase is problem-specific; we describe a fine-tuning task specific to clinical text.
A clinical note input to cb is represented as a collection of tokens. These tokens are subword units extracted from text in a preprocessing step (Sennrich et al., 2016). In cb, a token in a clinical note is computed as the sum of the token embedding, a learned segment embedding, and a position embedding. When multiple sequences of tokens are fed to cb, the segment embedding identifies which sequence a token is associated with. The position embedding of a token is a learned set of parameters corresponding to the token’s position in the input sequence (position embeddings are shared across tokens). A classification token cls is inserted in front of every sequence of input tokens, and is used in downstream classification tasks.
The attention function is computed on an input sequence, using the embeddings associated with the input tokens. The attention function takes as input a set of queries, keys, and values. To construct the queries, keys, and values, every input embedding is multiplied by learned sets of weights. For a single query, the output of the attention function is a weighted combination of values. The weight of a given value is determined by the interaction of the query and key. Denote a set of queries, keys, and values by , , and . The attention function is
where is the dimensionality of the queries, keys, and values. This function can be computed efficiently and can capture long-range interactions between any two elements in the input sequence (Vaswani et al., 2017). The length and complex patterns in clinical notes make the transformer architecture that uses this self-attention mechanism a good choice. (We describe how this self-attention mechanism leads to interpretability of clinical text in Section 4.)
The quality of learned representations of text depends on the text the model was trained on. bert is trained on BooksCorpus and Wikipedia. However, these two datasets are distinct from clinical notes, where jargon and abbreviations are common and notes have different syntax and grammar than common language in books or encyclopedias. It is hard to understand clinical notes without professional training. cb is pre-trained on clinical notes as follows.
cb uses the same pre-training tasks as in Devlin et al. (2018). Masked language modeling consists of masking
of the input tokens and using the model to predict the masked tokens. In next sentence prediction, two sentences are fed to the model. The model outputs a binary prediction of whether these two sentences are in consecutive order. The pre-training objective function based on the two tasks is the sum of the log-likelihood of the masked tokens and the log-likelihood of the binary variable indicating whether two sentences are consecutive.
After pre-training the model, cb is fine-tuned on a task specific to clinical data: readmission prediction. Let readmit
be a binary indicator of readmission of a patient within the next 30 days. Given clinical notes as input, the output of cb is used to predict the probability of readmission:
is the sigmoid function,is the output of the model associated with the classification token, and
is a parameter matrix. The model parameters are fine-tuned to maximize the log-likelihood of this binary classifier.
We developed cb, a model of clinical notes whose representations can be used for clinical tasks. Before evaluating its performance as a model of readmission in Section 4, we study its performance in two experiments. First, we find that cb outperforms the original bert model in language modeling tasks. In the second experiment, we compare cb to popular word embedding models using a clinical word similarity task. The relationships between medical concepts learned by cb exhibit better correlation with human evaluation of similarity. Code for training the cb model is available at https://github.com/kexinhuang12345/clinicalBERT and model parameter checkpoints are at http://bit.ly/clinicalbert_weights.
We use the public mimic dataset (Johnson et al., 2016). mimic consists of the electronic health records of 58,976 unique hospital admissions from 38,597 patients in the intensive care unit of the Beth Israel Deaconess Medical Center between 2001 and 2012. There are 2,083,180 de-identified notes associated with the admissions. Preprocessing of the clinical notes is described in Appendix B. For pre-training cb, we randomly sample 100,000 notes from mimic.
|Model||Masked language modeling||Next sentence prediction|
The parameters are initialized to the bertbase parameters released by Devlin et al. (2018)
; we follow their recommended hyperparameter settings. The model dimensionality is. We use the Adam optimizer (Kingma and Ba, 2015) with a learning rate of . The maximum sequence length supported by the model is set to 512, and the model is first trained using shorter sequences.222The details of constructing a sequence are in Devlin et al. (2018)
. For efficient minibatching that avoids padding minibatch elements of variable lengths with too many zeros, a corpus is split into multiple sequences of equal lengths. Many sentences are packed into a sequence until the maximum length is reached; a sequence may be composed of many sentences. The next sentence prediction task defined inDevlin et al. (2018) might more accurately be termed a next sequence prediction task. The model is trained using a maximum sequence length of 128 for 75,000 iterations on the masked language modeling and next sentence prediction tasks, with a batch size 32. Next, the model is trained on longer sequences of maximum length 512 for an additional 75,000 steps with a batch size of 4. The experiments are conducted on Amazon Web Services using a single K80 GPU. The masked language modeling task is illustrated in Figure 7.
We report the accuracy of the masked language modeling and next sentence prediction tasks on the mimic data in Table 1. The performance of bert suffers as the model has not been trained on clinical text. This highlights the need to develop models tailored to clinical data such as cb.
We test cb on a dataset designed to assess medical term similarity (Pedersen et al., 2017). This dataset consists of 30 pairs of medical terms whose similarity is rated by physicians. Although our model is intended for sequences, we can obtain a feature-based word embedding by feeding the model a sequence that consists of individual tokens corresponding to a medical term. Devlin et al. (2018)
conclude that the concatenation or sum of the last four hidden states of the encoders in bert has the best performance on downstream tasks compared to other combination of hidden states. We use the sum of the last four hidden states of encoders as a representation of medical terms. Medical terms are of various lengths in the clinical concept dataset, so the hidden states for each subword unit are summed and divided by the number of subword units. This results in a fixed 768-dimensional vector for each medical term in the dataset. We visualize the similarity of medical terms using dimensionality reduction(van der Maaten and Hinton, 2008). The full plot is in Appendix A; we highlight a cluster of heart-related concepts in Figure 4. cb has learned a representation space that groups similar medical concepts. Heart-related concepts such as myocardial infarction, atrial fibrillation, and myocardium are close together; renal failure and kidney failure are also close. cb has captured some clinical semantics.
The Pearson correlation is computed between the cosine similarity of embeddings learned by models of clinical text, and physician ratings of the similarity of medical concepts in thePedersen et al. (2017) dataset. These numbers are comparable to the best result (0.632) from Wang et al. (2018).
We benchmark embedding models using the clinical concept dataset in Pedersen et al. (2017). The dataset consists of concept pairs. The similarity of a pair of concepts is rated by physicians, with a score ranging from 1.0 to 4.0 (least similar to most similar). To evaluate representations of clinical text, we calculate the similarity between two concepts’ embeddings and using cosine similarity,
We calculate the Pearson correlation between physician ratings of medical concept similarity reported in Pedersen et al. (2017) and the computed cosine similarity between model embeddings. A high correlation score means that a model’s embeddings capture human-rated similarity between clinical terms. Wang et al. (2018) conducts a similar evaluation on this data using word2vec word embeddings (Mikolov et al., 2013) trained on clinical notes, biomedical literature, and Google News. However, they use a private clinical note dataset from The Mayo Clinic to train the word2vec model. For a fair comparison with cb, we retrain the word2vec model using clinical notes from mimic. The word2vec model is trained on 2.8 billion words from mimic with the same hyperparameters as in Wang et al. (2018). word2vec cannot handle out-of-vocabulary words; we ignore the three medical pairs in the clinical concepts dataset that do not have embeddings (the correlation score is computed using the remaining 27 medical pairs). Because of this shortcoming, we also train a FastText model (Bojanowski et al., 2017) on mimic. FastText can compute embeddings for out-of-vocabulary words using subword embeddings (we use the same hyperparameters as in Wang et al. (2018)).
Table 2 shows the correlations with physician judgments of cb and competing models. cb more accurately captures the relationships between clinical terms assessed from physician judgments.
The deep representations learned by cb can be used to build clinically-relevant models. We focus on the task of building a model for hospital readmission prediction using clinical notes. Compared to competitive models of language, cb accurately predicts readmission. cb also reveals interpretable patterns in medical data that can be used to understand its predictions.
We select a patient cohort from mimic (the full dataset is described in Section 3) using covariates associated with each patient. For the readmission prediction task, we compute the binary readmit label associated with each patient admission as follows. Patient admissions for which the patient is readmitted within 30 days receive a label of . All other patient admissions receive a label of zero. This includes patients with scheduled appointments within 30 days (since we are interested in unexpected readmission). We notice that 5,854 admissions are in-hospital deaths. Since death does not imply readmission, we remove these admissions. We also observe that there are 7,863 admissions where the patient is of type newborn. These are newborns in the neonatal intensive care unit, where most undergo testing and are sent back for routine care. This leads to a different distribution of clinical notes and readmission labels; we filter out newborns and focus on non-newborn readmissions. The final cohort contains 34,560 patients with 2,963 positive readmission labels and 48,150 negative labels.
Patient are typically associated with many notes. The cb model has a fixed maximum length of input sequence. We split notes into subsequences (each subsequence is of the maximum length supported by the model), and define how cb makes predictions on long sequences by by binning the predictions on each subsequence. The probability of readmission for a patient is computed as follows. Assume the patient’s clinical notes are represented as subsequences and fed to the model separately; the model outputs a probability for each subsequence. The probability of readmission is computed using the probabilities output for each of these subsequences:
where is a scaling factor that controls the amount of influence of the number of subsequences , and is the implicit representation cb computes from the entirety of a patient’s notes. is the maximum of probability of readmission across the subsequences, and is the mean of the probability of readmission across the subsequences a patient’s notes have been split into.
We find that computing readmission probability using Equation 3 consistently outperforms predictions on each subsequence individually by 3–8%. Equation 3 is motivated by observations: some subsequences (such as tokens corresponding to progress reports) do not contain information about readmission, whereas others do. The risk of readmission should be computed using subsequences that correlate with readmission risk, and the effect of unimportant subsequences should be minimized. This is accomplished by using the maximum probability over subsequences. Second, noisy subsequences mislead the model and decrease performance. For example, say there are 4 subsequences with a score lower than 0.3, and one one noisy subsequence with a score of 0.8. Simply using the maximum will not account for cases where the maximum is the noise: this would lead to a false prediction. Therefore, we also include the average probability of readmission across subsequences. This leads to a trade-off between the mean and maximum probabilities of readmission in Equation 3. Finally, if there are a large number of subsequences for a patient with many clinical notes, there is a higher probability of having a noisy maximum probability of readmission. This means longer sequences may need to have a larger weight on the mean prediction. We include this weight as the scaling factor, with adjusting for patients with many clinical notes. The denominator comes normalizing the final risk score to be in the unit interval. Empirically, we find that performs best on validation data.
For validation and testing, 10% of the data is held out respectively, and 5-fold cross-validation is conducted. Each model is evaluated using three metrics:
Area under the receiver operating characteristic curve: the area under the plot of the true positive rate against the false positive rate at various thresholds.
Area under the precision-recall curve: the area under the plot of precision versus recall at various thresholds.
Recall at precision of 80: For the readmission task, false positives are important. To minimize the number of false positives and thus minimize the risk of alarm fatigue, we set the precision to 80% (in other words, 20 false positives out of the predicted positive class) and use the corresponding threshold to calculate recall. This leads to a clinically-relevant metric that enables us to build models that control the false positive rate.
We compose two baselines based on results from Boag et al. (2018). Boag et al. (2018) conclude that the bag-of-words model and an lstm model with word2vec embeddings are two strong baselines for predictive tasks on mimic clinical notes. We compare cb with these two methods.
The training parameters are the entire encoder network, along with the classifier . Note that the data labels are imbalanced: negative labels are subsampled to balance the positive readmit
labels. In every experiment, cb is trained for one epoch with batch size 4 and learning rate. The cb model settings are the same as in Section 3. The binary classifier is a linear layer of shape , illustrated in Figure 2.
The bag-of-words model is a simple method that uses word counts to represent a note. We pick the top 5,000 tf-idf words as features. This means each note is represented by a 5,000-dimensional vector where each entry is the count of the corresponding vocabulary word occurring in the note. The top 5,000 tf-idf words are computed using the training set. The bag-of-words representation is then computed for every note in the training and test sets. Logistic regression with L2 regularization is used to fit the training readmission labels.
Although bag-of-words method is simple and fast, it does not consider the temporal relationship between words in the note. A bilstm (Schuster and Paliwal, 1997; Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005) is used to build a deep model of relationships between words in a sequence. For the input word embedding, the word2vec model from Section 3
is used. The bilstm has 200 output units, with a dropout rate of 0.1. The hidden state is fed into a global max pooling layer and a fully-connected layer with a dimensionality of 50, followed by a rectifier activation function. The rectifier is followed by a fully-connected layer with a single output unit with sigmoid activation function. The binary classification objective function is optimized using the Adam adaptive learning rate(Kingma and Ba, 2015). The bilstm is trained for three epochs with a batch size of 64 with early stopping based on the validation loss.
The mean of 5-fold cross validation is reported along with the standard deviation. cb outperforms both the bag-of-words model and the bilstm deep language model.
The discharge summary contains essential information of a patients’ stay since it is used by the post-hospital care team and doctors in future visits (Van Walraven et al., 2002). The summary may contain information like patients’ discharge conditions, procedures, and treatments, and significant findings (Kind and Smith, 2008)
. This means discharge summaries should have predictive value for hospital readmission. Table 3 shows that cb outperforms competitors in terms of precision and recall, on a task of readmission prediction based on patient discharge summaries.
|(a) Area under receiver operating characteristic|
|(b) Area under precision-recall|
|(c) Recall at precision of 80%|
Discharge summaries have predictive power for readmission. However, discharge summaries might be written after a patient has left the hospital. Therefore, discharge summaries are not actionable since doctors cannot intervene when a patient has left the hospital. Models that dynamically predict readmission in the early stages of a patient’s admission are relevant to clinicians. For the second set of readmission prediction experiments, a maximum of the first 48 or 72 hours of a patient’s notes are concatenated. These concatenated notes are used to predict readmission. Since we separate notes into subsequences of the same length, the training set consists of all subsequences within a maximum of 72 hours, and the model is tested given only available notes within the first 48 or 72 hours of a patient’s admission. Note that readmission predictions from a model are not actionable if a patient has been discharged. For testing 48 or 72-hour clinical note readmission prediction, patients that are discharged within 48 or 72 hours (respectively) are filtered out.
The models in Section 4.2 are evaluated using the metrics in Section 4.3. Table 4 shows that cb outperforms the baselines in both experiments. The receiver operating characteristic and precision-recall results show that cb has more confidence and higher accuracy. At a fixed rate of false alarms, cb recalls more patients that have been readmitted. The accuracy of cb increases as the length of admissions increases and the model has access to more clinical notes.
Clinicians may mistrust data-driven methods because of a lack of interpretable predictions: predictions from a neural network are difficult to understand for humans, and it is not clear why a model made a certain prediction or what parts of input data were most informative. cb uses several self-attention mechanisms which can be used to inspect its predictions, by visualizing terms correlated with predictions of hospital readmission.
For every clinical note input to cb, each self-attention mechanism computes a distribution over every term in a sentence, given a query. For a given query vector computed from an input token, the attention weight distribution is defined as
The attention weights are the weights used to compute the weighted sum of values in Equation 1. A high attention weight between a query and key token means the interaction between these tokens is predictive of readmission. In the cb encoder, there are 144 self-attention mechanisms (or, 12 multi-head attention mechanisms for each of the 12 transformer encoders). After training, each mechanism specializes to different patterns in clinical notes that are indicative of readmission.
To illustrate this, a sentence representative of a mimic note is input to cb. For this sentence, the queries are the tokens in the sentence, and the keys are the tokens in the same sentence. Select attention mechanism distributions for every query are computed using Equation 4 and visualized in Figure 5. The left panel shows an attention mechanism that is activated for the word chronic and acute given any query term. It means some heads search for specific predictive terms. This computation is similar to the bag-of-words model. Intuitively, presence of the token associated with the word “chronic" is a predictor of readmission.
The attention mechanism visualized in panel (a) of Figure 5 shows that attention weights that are high for keys within a certain window of a query token. This attention mechanism may focus on local information analogously to the local context window used in word2vec. The attention mechanism in panel (b) shows a pattern shifted below the diagonal: this means that the attention weight is high when the query and key terms are adjacent, which is reminiscent of a bigram model (trigram-type of attention mechanisms have also been observed). The attention weights in (c) are less interpretable. Figure 6 visualizes attention weights for a long clinical note. The large off-diagonal attention weights show that cb captures correlation across long ranges of clinical text to make readmission predictions.
We developed cb, a model for learning deep representations of clinical text. Empirically, cb is an accurate language model and captures semantic relationships in clinical text as judged by physicians. In a 30-day hospital readmission prediction task, cb outperforms a deep language model and yields a 15% relative increase on recall at a fixed rate of false alarms. Future work includes engineering to scale cb to capture dependencies in very long clinical notes; the max and sum operations in Equation 3 may not capture correlations within long notes. The publicly-available cb model parameters can be used to evaluate performance on clinically-relevant prediction tasks based on clinical notes.
We thank Noémie Elhadad for helpful discussion. Grass by Milinda Courey from the Noun Project.
spaCy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.To appear, 2017.
Scalable and accurate deep learning with electronic health records.NPJ Digital Medicine, 1(1):18, 2018.
cb requires minimal preprocessing. First, words are converted to lowercase and line breaks and carriage returns are removed. Then de-identified brackets and remove special characters like are removed.
The next sentence prediction pretraining task described in [ requires two sentences at every iteration. The SpaCy sentence segmentation package is used to segment each note (Honnibal and Montani, 2017). Since clinical notes don’t follow rigid standard language grammar, we find rule-based segmentation has better results than dependency-parsing-based segmentation. Various segmentation signs that misguide rule-based segmentators are removed (such as ) or replaced ( with ). Clinical notes can include various lab results and medications that also contain numerous rule-based separators, such as , , . To address this, segmentations that have less than 20 words are fused into the previous segmentation so that they are not singled out as different sentences.