Clinical Predictive Keyboard using Statistical and Neural Language Modeling

06/22/2020 ∙ by John Pavlopoulos, et al. ∙ Stockholms universitet 0

A language model can be used to predict the next word during authoring, to correct spelling or to accelerate writing (e.g., in sms or emails). Language models, however, have only been applied in a very small scale to assist physicians during authoring (e.g., discharge summaries or radiology reports). But along with the assistance to the physician, computer-based systems which expedite the patient's exit also assist in decreasing the hospital infections. We employed statistical and neural language modeling to predict the next word of a clinical text and assess all the models in terms of accuracy and keystroke discount in two datasets with radiology reports. We show that a neural language model can achieve as high as 51.3 two words predicted correctly). We also show that even when the models are employed only for frequent words, the physician can save valuable time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Text prediction is a challenging problem in machine learning and natural language processing, while at the same time there is a growing need for novel techniques for efficient and accurate text prediction in several application domains, such as in dictation and typing systems for people with disabilities or clinical text prediction for healthcare practitioners

[6]. More concretely, with text prediction we refer to the task of predicting the next block of text in an online fashion, where block can refer to different text granularity levels, e.g., sentences, words, syllables, or characters (keystrokes) [7].

The main focus of this paper is medical text with the concrete task of predicting the next word given an incomplete text. We also refer to this problem as predictive keyboard for medical text. When applied in the clinical setting (e.g., authoring of hospital discharge summaries or diagnostic text), physicians can vastly benefit from a fast and accurate predictive keyboard system, since it can assist them with (a) a speedy compilation of the intended text, (b) means for prevention of potential text errors due to work overload, (c) means for speedier patient discharge.

Initial efforts towards solving the predictive keyboard problem for radiology reports are described by Eng and Eisner 2004 [5]

, where a 3-Gram language model achieves an average keystroke reduction of a factor of 3.3. Following this line of research, we employed N-Gram-based statistical language modeling, which refer to as N-GLM, to predict the next word of a clinical text. We vary N from 1 to 10 and show that 4-Gram models achieve 38% accuracy when predicting the next word in a clinical text, outperforming other N-GLMs. Observe that accuracy in this case measures the fraction of times when the next word was predicted correctly, hence inducing an equivalent typing speedup at the word level. We additionally investigated two neural language models that employ (1) a Recurrent Neural Network (RNN) language model based on Long-Short Term Memory, which we refer to as LSTMLM


and (2) a Gated Recurrent Unit (GRU) based language model, which we refer to as GRULM. This model achieves higher levels of accuracy compared to 4-GLM, since our experimental evaluation demonstrates that accuracy can reach up to 51.3% (i.e., 5 out of 10 ‘next’ words predicted correctly). An example of the output of this task is depicted in Table

I, where we can observe the next word predictions made by LSTMLM and 4-GLM, with the correctly predicted words indicated in ’[]’.

Next, we outline the related work in the area of clinical text prediction, followed by a summary of our contributions.

LSTMLM ”the lungs are clear without [evidence] [of] focal infiltrate [or] [effusion] [there] [is] [no] [pneumothorax] [the] [visualized] [bony] [structures] [reveal] [no] [acute] [abnormalities]”
4-GLM ”the lungs are [clear] without evidence [of] [focal] infiltrate or effusion [there] [is] [no] pneumothorax [the] visualized bony [structures] reveal [no] [acute] [abnormalities]”
TABLE I: Example use-case on IUXRay test words, using 4-GLM and LSTMLM. Words in [] were correctly predicted by each model.

I-a Related Work

The study of the benefits of computer-assisted text generation dates back to more than two decades ago

[13]. When applied to clinical notes, such as radiology reports, a statistical 3-Gram language model (including back off) achieved substantial keystroke reductions [5]. Recently, an even simpler 3-Gram language model (i.e., with no back off) outperformed the earlier while also decreasing the typing time for the clinician by one third [22]. These results demonstrate that N-Gram models can provide promising solutions to our problem, and hence in this paper we provide a more extensive evaluation of these models on medical text. Besides computer-assisted typing, language models have also been used for spelling correction in clinical notes [17, 21]. This work does not focus on spelling correction, but what these works verify is that the words suggested by the language model during typing, are also checked for their correctness (i.e., assuming that the corpus contains correct words), hence the generated text will be of equal or even higher quality.

With the recent advance of deep learning, deep neural networks, such as Long-short Term Memory (LSTM)


models, have improved the performance in natural language processing (NLP) tasks of the biomedical field, such as Name Entity Recognition (NER)

[14]; medical codes prediction [16]; relation classification [15]; predicting hospital readmission [9]. And language modeling is also part of this advent, since it is often employed as a pre-training step [18]. For the task of next word prediction in a medical setting, however, neural language modeling is heavily under explored. To our knowledge, the only application of a neural language model was that of a baseline LSTM network (applied on a private dataset), which was improved when structured information from electronic health data (e.g., gender or age) was integrated [19]. The authors reported 8% Accuracy (a.k.a. Recall@1 or Precision@1) for the baseline LSTM, which ranks it much lower than competing statistical language models [15]. However, neural networks have been reported to outperform statistical language modeling in non-medical domains [3]. In this work, we compare statistical and neural language modeling, a comparison which has not been studied before, and we show that the neural approach outperforms the statistical approach in next word prediction by a large margin.

I-B Contributions

The main contributions of this paper can be summarized as follows: (1) We highlight the importance of the problem of keyword prediction for clinical text, and demonstrate how language models can be employed for providing scalable solutions to this problem; (2) We provide an extensive benchmark on clinical text obtained from two real-world medical datasets by comparing the performance of the N-GLM model for different values of N in terms of accuracy and keystroke reduction; (3) We additionally compare an RNN language model based on LSTM and GRU on the same datasets and demonstrate their superiority against N-GLMs as they can achieve an accuracy of up to 51.3%, indicating a speedup (at the word level) of the same degree, and a keyword reduction of up to 41.12%, indicating a speedup (at the character level) of the same degree.

Ii Methods

Ii-a Statistical Language Modeling

Statistical language models [10, 12]

are based on the Markov assumption, modeling the probability of the next word, but given only the

preceding words. The counts of all sequences of words (a.k.a.

-grams) are calculated over a corpus and a probability distribution over the vocabulary is modeled for each gram of



Then, a -gram (a.k.a. bigram) model will only consider the previous word to predict a next word . And will be the one most frequently occurring in the corpus right after

. Probabilities are formed using the maximum likelihood estimation, changing Eq. 

1 to:


where are the counts of the gram. To deal with unknown words, a pseudo token can be introduced (e.g., masking very rare words with ‘[oov]

’ during training). And to deal with unseen sequences of words one can introduce smoothing or backoff and interpolation. In this work we employed Laplace smoothing, but we observe that algorithms such as the Knesser-Ney or the Good-Turing backoff should also be investigated. For more information regarding statistical language models we redirect the interested reader to


Ii-B Neural Language Modeling

Neural language modeling makes it possible to consider long range word dependencies without an explicitly predefined context length [20]. The neural language model at each time step learns a hidden state as the non-linear combination (the weight matrix is learned) of the input word and the previous hidden state

. The vanishing gradient problem, arising from the deep in time back-propagation, is addressed with the LSTM cells

[8]. More formally:


where is the input gate and is the forget get, which regulate the information from this () and the previous () cell to be forgotten, and is the output gate which regulates the information of the new hidden state. Then, the generation of the next word can be seen as a classification task, with yielding a probability distribution over the whole vocabulary and the next word to be generated being the most probable one.

In this work we also experiment with a different RNN variant, called Gated Recurrent Unit (GRU) [1], which is considered to be more efficient than LSTMs [2]. It has a similar formulation with LSTMs:


where and are the reset and update gates, defined similarly to the input and forget LSTM gates. No output gate is used, leading to a smaller number of gates and less computations, which makes GRU more efficient than LSTM.

Iii Empirical Evaluation

Iii-a Datasets

We used two real-world medical datasets.

IUXRay. The dataset comprises 3,955 anonymized and de-identified radiology reports on 7,470 images [4]111 The text of each report follows an XML structure and the boundaries of each different section are explicitly defined.

MIMIC-III. We used the radiology reports from the Medical Information Mart for Intensive Care (MIMIC-III) [11] database, a rich and commonly used benchmark dataset of 38,597 adult patients admitted between 2001 and 2008 to critical care units at Beth Israel Deaconess Medical Center in Boston, Massachusetts. In this study we employ the free text reports of electrocardiogram and imaging studies included in this dataset. The text of the radiology reports in MIMIC-III is loosely separated in sections, which are not explicitly marked up. We sampled 2,928 such reports to yield a dataset equal in number to IUXRay.

The radiology reports of IUXRay and MIMIC-III comprise less than 200 tokens per report in average. By contrast, the discharge summaries are lengthier and more than quadruple in size. The difference grows larger when sampling disregards the maximum number of characters per text, because only discharge summaries did exceed this threshold.222The average number of words per summary without sampling is 1320.

Acc KD Acc KD
2-GLM 21.83±0.29 16.04±0.26 17.03±0.22 11.46±0.12
3-GLM 34.78±0.38 27.96±0.27 27.34±0.29 19.35±0.27
4-GLM 38.18±0.44 31.60±0.30 25.70±0.29 18.95±0.34
5-GLM 37.89±0.60 32.30±0.47 21.02±0.41 15.63±0.23
6-GLM 35.71±0.78 30.86±0.57 15.98±0.42 11.93±0.31
7-GLM 33.10±0.72 28.82±0.56 12.15±0.40 9.05±0.26
8-GLM 30.23±0.63 26.47±0.62 9.52±0.40 7.04±0.31
9-GLM 27.74±0.63 24.33±0.66 7.29±0.43 5.46±0.37
LSTMLM 51.30±0.61 41.12±0.64 33.97±0.25 25.17±0.29
GRULM 51.30±0.74 41.00±0.40 33.84±0.34 25.42±0.30
TABLE II: Assessment of next word prediction in the radiology reports of IUXRay and MIMIC-III, using statistical (N-GLMs) and neural (LSTMLM, GRULM) language models. Micro-averaged accuracy (Acc) and keystroke discount (KD) are shown for each dataset.

Iii-B Results

We benchmarked eight statistical language models and two neural language models for the task of predicting the next word in radiology reports of IUXRay and MIMIC-III. We randomly sampled reports from MIMIC-III until we obtained a subset with the same number of reports as IUXRay. We additionally removed numbers, punctuation, and turned to lower-case before white space tokenization. We held the 10K last words from each dataset as our test set and used the previous to train our models. Any words occurring less than 10 times were masked with an oov token.

The statistical language models were N-Gram-based models, with factor N varying from 2 (only the previous word was considered) to 9. The neural language models were based either on LSTM or GRU. Following the work of [19]

, we used 50 dimensions for all the hidden representations. Furthermore, we used: a vocabulary of the 1000 most frequent words; a context window of 5 preceding words; uniformly initialized word embeddings of 200 dimensions; a single-layer feed-forward neural network of 100 dimensions and a


activation before the softmax; Adam optimization and categorical cross entropy; batch size 128; 10% validation split; early stopping of 100 epochs with patience of 3 epochs; validation loss monitoring.

First, we assessed all models based on their ability to reduce the keystrokes (Keystroke Discounting, ). Since no log files were available for calculating this number directly, we estimated this score based on the length of the words which were correctly predicted by each system. That is, we assume that instead of striking the keyboard as many times as the characters of a word, the physician, during a computer-assisted data entry, simply accepts the correctly predicted word (e.g., by pressing tab or so). More formally, for a sequence of words () and the respective sequence of system-predicted words (), this measure is defined as:


where is the number of characters of word and is 1 when the word was correctly predicted by a system and equal to the length of the correct word otherwise. When equals to 1, all words are predicted correctly, while when it is equal to 0, no word was predicted correctly. Also, we used micro-averaged accuracy (here, same as precision or recall), which is defined as the fraction of the correctly predicted words out of all the words in the test. For both measures the occurrences of the de-identification token (‘xxxx’) were disregarded during evaluation, because the ability of the systems to locate candidate de-identification terms is out of the scope of this work. However, in principle, medical language models could be used to assist humans in de-identifying medical texts. And we considered all oov occurrences as system mistakes.

Table II shows the keyword discount (KD) and the micro-averaged accuracy (Acc) scores for the task of next word prediction, for all systems and datasets. Neural language models outperform statistical language models in both datasets by a large margin. In MIMIC-III, the keyword discount is increased by 6 absolute percentage units, from 19.35% to 25.42% (or 31% relative increase). A similar increase was found for accuracy, from 27.34% to 33.97% (or 24% relative increase). In IUXRay, the increase was larger, with 9 absolute percentage units of KD (from 32.30% to 41.12%) and 13 absolute percentage units of Acc (from 38.18% to 51.30%). The top eight rows of Table II show the different N-Gram based statistical language models. 3-GLM was the best with both evaluation measures in MIMIC-III. In IUXRay, 4-GLM was found as the best in terms of Acc and 5-GLM in KD. We obtained better performance for IUXRay due to its smaller vocabulary size compared to MIMIC-III.

In a real-world setting, however, physicians may prefer to use the advantages of computer-assisted authoring only for specific terms, as for example frequent words, frequent medical terms, or frequent non-medical terms. Thus, in a final experiment, we assumed a deployment setting where the predicted word was only shown if it was one of the frequent ones, and we varied the number of frequent words to be considered. Fig. 1 shows the absolute number of keystrokes omitted when the best performing LSTMLM was applied. Interestingly, even though the target vocabulary is reduced to only 50 words, we can observe a decrease of more than 15K keystrokes. For the case of the 50 most frequent words, without the use of LSTMLM the keystrokes would have been approximately 50K.

Fig. 1: Absolute number of keystroke reduction, by applying LSTMLM only when a frequent vocabulary word is predicted. On the x-axis we see the sizes of the frequent-word sets that are employed.

Iv Conclusion

We highlighted the importance of predictive keyboard for medical text and demonstrated the benefits for the physicians in terms of speedups in completing their clinical text reports. Our experimental evaluation on radiology reports from two real-world medical datasets showed that neural language models can achieve an accuracy of up to 51.3%, which implies that the obtained speedups correspond to a similar factor at the word level. Directions for future work include the investigation of alternative statistical and deep learning models, the consideration of additional medical datasets (e.g., discharge summaries), and measuring the speedups in a real-world application with models deployed in healthcare systems.


We thank the anonymous reviewers for their comments. This work was supported in part by the Swedish Research Council starting grant Temporal Data Mining for Detective Adverse Events in Healthcare, ref. no. VR-2016-03372 as well as the EXTREME project of the Digital Futures framework.


  • [1] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    arXiv preprint arXiv:1409.1259. Cited by: §II-B.
  • [2] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §II-B.
  • [3] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 933–941. Cited by: §I-A.
  • [4] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald (2015) Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23 (2), pp. 304–310. External Links: Document Cited by: §III-A.
  • [5] J. Eng and J. M. Eisner (2004) Radiology report entry with automatic phrase completion driven by language modeling. Radiographics 24 (5), pp. 1493–1501. Cited by: §I-A, §I.
  • [6] N. Garay-Vitoria and J. Abascal (2006-02) Text prediction systems: a survey. Univers. Access Inf. Soc. 4 (3), pp. 188–203. External Links: ISSN 1615-5289, Link, Document Cited by: §I.
  • [7] J. Gelšvartas, R. Simutis, and R. Maskeliūnas (2016-01) User adaptive text predictor for mentally disabled huntington’s patients. Intell. Neuroscience 2016. External Links: ISSN 1687-5265, Link, Document Cited by: §I.
  • [8] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. External Links: Document Cited by: §I-A, §I, §II-B.
  • [9] K. Huang, J. Altosaar, and R. Ranganath (2019) Clinicalbert: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. Cited by: §I-A.
  • [10] F. Jelinek (1997) Statistical methods for speech recognition. MIT press. Cited by: §II-A.
  • [11] A. Johnson, T. J. Pollard, L. Shen, L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3, pp. 160035. Cited by: §III-A.
  • [12] D. Jurafsky (2000) Speech & language processing. Pearson Education India. Cited by: §II-A, §II-A.
  • [13] H. H. Koester and S. P. Levine (1994) Learning and performance of able-bodied individuals using scanning systems with and without word prediction. Assistive Technology 6 (1), pp. 42–53. Cited by: §I-A.
  • [14] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §I-A.
  • [15] Y. Luo (2017)

    Recurrent neural networks for classifying relations in clinical notes

    Journal of biomedical informatics 72, pp. 85–95. Cited by: §I-A.
  • [16] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, and J. Eisenstein (2018) Explainable prediction of medical codes from clinical text. arXiv preprint arXiv:1802.05695. Cited by: §I-A.
  • [17] J. Patrick, M. Sabbagh, S. Jain, and H. Zheng (2010) Spelling correction in clinical notes with emphasis on first suggestion accuracy. In 2nd Workshop on Building and Evaluating Resources for Biomedical Text Mining, pp. 1–8. Cited by: §I-A.
  • [18] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §I-A.
  • [19] G. P. Spithourakis, S. E. Petersen, and S. Riedel (2016) Clinical text prediction with numerically grounded conditional language models. arXiv preprint arXiv:1610.06370. Cited by: §I-A, §III-B.
  • [20] M. Sundermeyer, H. Ney, and R. Schlüter (2015) From feedforward to recurrent lstm neural networks for language modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23 (3), pp. 517–529. Cited by: §II-B.
  • [21] A. Yazdani, M. Ghazisaeedi, N. Ahmadinejad, M. Giti, H. Amjadi, and A. Nahvijou (2019) Automated misspelling detection and correction in persian clinical text. Journal of Digital Imaging, pp. 1–8. Cited by: §I-A.
  • [22] A. Yazdani, R. Safdari, A. Golkar, and S. R. N. Kalhori (2019) Words prediction based on n-gram model for free-text entry in electronic health records. Health information science and systems 7 (1), pp. 6. Cited by: §I-A.