Log In Sign Up

Automatic Documentation of ICD Codes with Far-Field Speech Recognition

by   Albert Haque, et al.

Documentation errors increase healthcare costs and cause unnecessary patient deaths. As the standard language for diagnoses and billing, ICD codes serve as the foundation for medical documentation worldwide. Despite the prevalence of electronic medical records, hospitals still witness high levels of ICD miscoding. In this paper, we propose to automatically document ICD codes with far-field speech recognition. Far-field speech occurs when the microphone is located several meters from the source, as is common with smart homes and security systems. Our method combines acoustic signal processing with recurrent neural networks to recognize and document ICD codes in real time. To evaluate our model, we collected a far-field speech dataset of ICD-10 codes used in emergency departments and found our model to achieve 87 score of 85 method is able to outperform existing methods. This work shows the potential of automatic speech recognition to provide efficient, accurate, and cost-effective documentation.


page 2

page 8


Scene-aware Far-field Automatic Speech Recognition

We propose a novel method for generating scene-aware training data for f...

Spatial Attention for Far-field Speech Recognition with Deep Beamforming Neural Networks

In this paper, we introduce spatial attention for refining the informati...

Kurdish (Sorani) Speech to Text: Presenting an Experimental Dataset

We present an experimental dataset, Basic Dataset for Sorani Kurdish Aut...

Open Challenge for Correcting Errors of Speech Recognition Systems

The paper announces the new long-term challenge for improving the perfor...

Low-frequency compensated synthetic impulse responses for improved far-field speech recognition

We propose a method for generating low-frequency compensated synthetic i...

Exploiting Nontrivial Connectivity for Automatic Speech Recognition

Nontrivial connectivity has allowed the training of very deep networks b...

1 Introduction

More than 250,000 people die every year in the United States due to medical errors, making it the third leading cause of death james2013new ; cdc2016deaths . Unsurprisingly, many of these errors are preventable, such as inaccurate drug doses, unlisted allergies, and wrong-site amputations. Directly responsible for many of these errors is poor documentation hartel2011high ; mulloy2008wrong ; henderson2006quality . Worldwide, ICD codes serve as the standard language for diagnoses, treatment, and billing icd2018usa ; icd2017uk . However, ICD miscoding occurs as much as as 20%, with similar rates dating back to the 1990s henderson2006quality ; macintyre1997accuracy . Despite electronic medical records, documentation errors still occur and cost the United States up to $25 billion each year lang2007consultant ; farkas2008automatic .

One of the main sources of ICD miscoding is during patient admission o2005measuring . This is paramount for emergency departments, which handle 141 million visits each year in the United States cdc2014edstats . However, emergency departments are overcrowded, leading to overworked and rushed clinical teams, which can ultimately produce more errors us2011gao ; sun2013effect ; hooper2010compassion . While there has been work on inferring ICD codes from text, those methods still require a written document larkey1995automatic ; subotin2016method ; tutubalina2017encoder ; huang2018empirical . If we can automate such documentation, not only can we potentially reduce medical errors, but we can also free up time from medical teams, who spend up to 26% of their time on documentation tasks alone ammenwerth2009time ; arndt2017tethered .

Figure 1: Overview of our method. The dashed arrow denotes random sampling. Green denotes the encoder, yellow is the decoder, blue is the language model, and gray is the model output.

Outside of healthcare, smart homes have enjoyed the benefits of voice-based home assistants such as Google Home and Amazon Alexa li2017acoustic . These devices can perform interactive search queries to accelerate ordinary household workflows such as cooking or morning routines. In this paper, we bring the advances of smart homes and far-field speech recognition to automatically document ICD codes in healthcare. The input to our model is a spectrogram and the output is a text transcription chiu2018speech . By sampling from an unsupervised language model during training, we can achieve a knowledge distillation effect to improve overall transcription performance.

2 Proposed Method

Overall, our method consists of a sequence-to-sequence deep neural network as an acoustic model (see Figure 1). During training, the acoustic model samples from an unsupervised medical language model to improve transcription.

Acoustic Model. Medical words are often longer than conversational speech. They can contain many syllables, such as the words hyperventilation. As a consequence, the resulting ICD code is long, sometimes five to ten seconds in length. Modeling the full spectrogram would require unrolling of the encoder RNN for an infeasibily large number of timesteps, on the order of hundreds to thousands of RNN time steps sainath2015learning

. Even with truncated backpropagation, this would be a challenging task

haykin2001kalman .

Inspired by sainath2015learning , we propose to reduce the temporal length of the input spectrogram by using a learned convolutional filter bank. As shown in Figure 1, silence appears throughout the spectrogram, denoted by dark blue regions. By treating the spectrogram as an image, convolutional filters can not only reduce the temporal dimension but also compress redundant frequency-level information, such as the absence of high-frequencies. However, we found convolutional networks insufficient to reduce the temporal dimension to a manageable length. WaveNet, a method for speech synthesis, employs dilated convolutions to control the temporal receptive field at each layer of the network van2016wavenet ; dutilleux1990implementation ; yu2015multi . As a result, a single temporal representation from a high-level layer can encode hundreds, if not thousands of RNN timesteps. Formally, let

denote the hidden state of a long short-term memory (LSTM)

hochreiter1997long cell at the -th timestep of the -th layer (Equation 1). For a pyramidal LSTM (pLSTM), outputs from the preceding layer, containing high-resolution temporal information, are concatenated,


where denotes the concatenation operator. In Equation 1, the output of a pLSTM unit is now a function of not only its previous hidden state, but also the outputs from previous timesteps from the layer below. The pLSTM provides us with an exponential reduction in number of RNN timesteps. Not only does the pyramidal RNN provide higher-level temporal features, but it also reduces the inference complexity chan2016listen . Because we do not use a bidirectional RNN, our model can run in real-time. After the pLSTM (encoder) processes the input, the states are given to the decoder.

The concept of attention has proven useful in many tasks chorowski2015attention ; mascharka2018transparency . With attention, each step of the model’s decoder has access to all of the encoder’s outputs. The goal of such an “information shortcut" is to address the challenge of learning long-range temporal dependencies bengio1994learning . The distribution for the predicted word is a function of the decoder state and attention context

. The context vector

is produced by an attention mechanism chan2016listen . Specifically, , where attention is defined as the alignment between the current decoder timestep and encoder timestep :


and where the score between the output of the encoder or the hidden states, , and the previous state of the decoder cell, is computed with where and

are sub-networks, e.g. multi-layer perceptrons. The

, and are learnable parameters.

The final output from the decoder is a sequence of word embeddings, which encode the transcribed sentence (e.g., one-hot vector). This can be trained by optimizing the cross-entropy loss objective. Many previous works end at this point by performing greedy-search or beam-search decoding methods sutskever2014sequence ; graves2006connectionist , but we propose to randomly sample from the language model to improve the transcription.

Language Model.

Our language model is an n-gram model. Given a sequence of words

, the model assigns a probability

of the sequence occurring in the training corpus:


where . By default,

. To overcome the paucity of higher-order n-grams, we approximate the n-gram probability by interpolating the individual n-gram probabilities.

Combining the Acoustic and Language Models. To improve our ICD transcriptions, we impose a training constraint on the acoustic model, subject to the language model. However, such a concept is not new. When a language model is used during inference but not training chorowski2016towards ; wu2016google , we call this shallow fusion . This was extended by sriram2017cold in the paper Cold Fusion, to include the language model during training kannan2017analysis . However, in their work, the language model was fixed during training and kept as a feature extractor. In our work, we keep the language model constant, but instead combine cold fusion with scheduled sampling bengio2015scheduled . The effect is a type of unsupervised knowledge distillation hinton2015distilling . We can combine the language model and our acoustic model to select the optimal transcription :


where and denote the acoustic and language model, respectively. The

’s control the language model sampling probability and also serve as mixing hyperparameters. The

’s denotes the posterior probability and

denotes the input spectrogram. The entire process is differentiable and can be optimized with first-order methods kingma2014adam .

3 Experiments

Dataset. We collected a far-field speech dataset of common ICD-10 codes used in emergency departments aapc2014icd

. First, a pre-determined list of one hundred ICD-10 codes and descriptions was selected. This resulted in 141 unique words with an average character length of 7.3, median length of 7, maximum length of 16, and a standard deviation of 3.2 characters. Second, the vocabulary list was repeated five times by multiple speakers. This was done to collect diverse pitch tracks and intonation from each speaker. Third, full ICD code descriptions were generated by procedurally concatenating each individual word. For a single ICD code, there were 20,730 acoustic variations per speaker.

A total of six speakers participated in data collection. Each speaker stood 12 feet (3.6 meters) from a computer monitor and microphone. For all experiments, one speaker was excluded from the training set and used as the test set. This was done six times, such that each speaker was part of the test set once, and the average WER and BLEU are reported. Although our procedural dataset generation can allow for millions of training examples, we limited the variations per ICD code to 1,000. As a result, the training set consisted of 60,480 sentences per speaker, for a total training size of 302,400 sentences. Words were converted to a one-hot word embedding and used as decoder targets.

Language Model. The language model was trained on the entire ICD medical corpus icd2018usa . The corpus consisted of 94,127 sentences, 7,153 unique words, 922,201 total words. Punctuations such as commas, dashes, and semi-colons were removed. The model was trained on n-grams up to length 10.

Metrics. We use word error rate (WER) and BLEU as metrics. The WER is defined as where denotes the number of word substitutions, denotes deletions, denotes insertions, and denotes the number of words in the ground truth sentence. If then accuracy is equivalent to recall (i.e., sensitivity). Word-level accuracy is denoted by . Common in machine translation but less used in speech recognition, we use the Bilingual Evaluation Understudy (BLEU) metric since it can better capture contextual and syntactic roles of a word he2011word .

Method Native Non-Native Native Non-Native
Human (Medically Trained)
Human (Untrained)
Connectionist Temporal Classification graves2006connectionist
Sequence-to-Sequence sutskever2014sequence
Listen, Attend, and Spell chan2016listen ; chiu2018speech
Cold Fusion deepspeech3
Our Method
Table 1: Comparison with existing methods. Lower WER and higher BLEU is better. Human refers to manual transcription. Native refers to native English speakers. All methods were trained and evaluated on our ICD-10 dataset.

denotes the 95% confidence interval.

Results. Table 1

shows quantitative results for existing methods and our proposed method. For most methods, the performance on non-native English speakers is lower than native speakers. This is to be expected due to the larger variances in non-native speech, especially for complex Latin medical words. Surprisingly, Cold Fusion has a higher WER and lower BLEU than the LAS model, despite cold fusion using an external language model. One explanation is the establishment of a dependence, on a potentially biased language model. Our method could be viewed as the same as cold fusion, but with a smaller mixing parameter. However, we did not run exhaustive tuning to find the optimal hyperparemeters for this dataset. As for CTC, the poor performance could partially be attributed to CTC’s design for phoneme-level recognition. In our task, we evaluate CTC at word-level, which significantly increases the branching factor (i.e., there are more unique words than phonemes).

Qualitative Comparison. Table 2 shows “qualitative" results of our model compared to the baselines. While only two ICD codes are shown in Table 2, in general, many mistakes made by Cold Fusion and our method occur on words which have a complementary or opposite pair (e.g., with/without, left/right, lower/upper). Either word from the pair is valid, according to the language model. The slightest acoustic variation such as a breath or pause may cause the word to be incorrectly substituted.

Method Transcription Transcription
CTC Generalized pain Pain abdominal
Seq2Seq Abdominal pain Pain injury without loss of consciousness
LAS Lower abdominal pain Intracranial injury without loss of consciousness
Cold Fusion Generalized abdominal pain Intracranial injury with loss of consciousness
Our Method Generalized abdominal pain Intracranial injury without loss of consciousness
Ground Truth Generalized abdominal pain Intracranial injury without loss of consciousness
Table 2: Transcriptions for two ICD codes. Bold text indicates a substitution, insertion, or deletion error. Each column indicates a single test-set example. The ground truth is shown in the bottom row.

4 Conclusion

In this work, we presented a method to automatically document ICD codes with far-field speech recognition. Our method combines acoustic signal processing techniques with deep learning-based approaches to recognize ICD codes. There is future research to be done on handling multiple speakers and tonal languages such as Chinese. Recent work on style and speaker tokens may prove beneficial

wang2018style ; haque2018conditional . Overall, this work shows the potential of modern automatic speech recognition to provide efficient, accurate, and cost-effective healthcare documentation.


  • [1] AAPC. Icd-10 top 50 codes in emergency departments, 2014.
  • [2] E. Ammenwerth and H.-P. Spötl. The time needed for clinical documentation versus direct patient care. Methods of Information in Medicine, 2009.
  • [3] B. G. Arndt, J. W. Beasley, M. D. Watkinson, J. L. Temte, W.-J. Tuan, C. A. Sinsky, and V. J. Gilchrist. Tethered to the ehr: primary care physician workload assessment using ehr event log data and time-motion observations. The Annals of Family Medicine, 2017.
  • [4] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, S. Satheesh, D. Seetapun, A. Sriram, and Z. Zhu. Exploring neural transducers for end-to-end speech recognition. Automatic Speech Recognition and Understanding Workshop, 2017.
  • [5] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Neural Information Processing Systems, 2015.
  • [6] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. Trans. on Neural Networks, 1994.
  • [7] CDC. National hospital ambulatory medical care survey: 2014 emergency department summary tables, 2014.
  • [8] CDC. Deaths and mortality, 2016.
  • [9] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In International Conference on Acoustics, Speech, and Signal Processing, 2016.
  • [10] J. Chorowski and N. Jaitly. Towards better decoding and language model integration in sequence to sequence models. arXiv, 2016.
  • [11] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio. Attention-based models for speech recognition. In Neural Information Processing Systems, 2015.
  • [12] CMS. Icd-10-cm official guidelines for coding and reporting, 2016.
  • [13] P. Dutilleux. An implementation of the “algorithme à trous” to compute the wavelet transform. In Wavelets. Springer, 1990.
  • [14] R. Farkas and G. Szarvas. Automatic construction of rule-based icd-9-cm coding systems. In BMC Bioinformatics, 2008.
  • [15] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In

    International Conference on Machine Learning

    , 2006.
  • [16] A. Haque, M. Guo, and P. Verma. Conditional end-to-end audio transforms. Interspeech, 2018.
  • [17] M. J. Hartel, L. P. Staub, C. Röder, and S. Eggli. High incidence of medication documentation errors in a swiss university hospital due to the handwritten prescription process. BMC Health Services Research, 2011.
  • [18] S. S. Haykin et al. Kalman Filtering and Neural Networks. Wiley Online Library, 2001.
  • [19] X. He, L. Deng, and A. Acero. Why word error rate is not a good metric for speech recognizer training for the speech translation task? In International Conference on Acoustics, Speech, and Signal Processing, 2011.
  • [20] T. Henderson, J. Shepheard, and V. Sundararajan. Quality of diagnosis and procedure coding in icd-10 administrative data. Medical Care, 2006.
  • [21] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv, 2015.
  • [22] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 1997.
  • [23] C. Hooper, J. Craig, D. R. Janvrin, M. A. Wetsel, and E. Reimels. Compassion satisfaction, burnout, and compassion fatigue among emergency nurses compared with nurses in other selected inpatient specialties. Journal of Emergency Nursing, 2010.
  • [24] J. Huang, C. Osorio, and L. W. Sy. An empirical evaluation of deep learning for icd-9 code assignment using mimic-iii clinical notes. arXiv, 2018.
  • [25] J. T. James.

    A new, evidence-based estimate of patient harms associated with hospital care.

    Journal of Patient Safety, 2013.
  • [26] D. Jaunzeikare, A. Kannan, P. Nguyen, H. Sak, A. Sankar, J. Tansuwan, N. Wan, Y. Wu, and X. Zhang. Speech recognition for medical conversations. Interspeech, 2018.
  • [27] A. Kannan, Y. Wu, P. Nguyen, T. N. Sainath, Z. Chen, and R. Prabhavalkar. An analysis of incorporating an external language model into a sequence-to-sequence model. arXiv, 2017.
  • [28] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015.
  • [29] D. Lang. Natural language processing in the health care industry. Cincinnati Children’s Hospital Medical Center, Winter, 2007.
  • [30] L. S. Larkey and W. B. Croft. Automatic assignment of icd9 codes to discharge summaries. Technical report, University of Massachusetts at Amherst, 1995.
  • [31] B. Li, T. Sainath, A. Narayanan, J. Caroselli, M. Bacchiani, A. Misra, I. Shafran, H. Sak, G. Pundak, K. Chin, et al. Acoustic modeling for google home. Interspeech, 2017.
  • [32] C. R. MacIntyre, M. J. Ackland, E. J. Chandraraj, and J. E. Pilla. Accuracy of icd–9–cm codes in hospital morbidity data, victoria: implications for public health research. Australian and New Zealand Journal of Public Health, 1997.
  • [33] D. Mascharka, P. Tran, R. Soklaski, and A. Majumdar. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. arXiv, 2018.
  • [34] D. F. Mulloy and R. G. Hughes. Wrong-site surgery: a preventable medical error. Patient Safety and Quality; An Evidence-Based Handbook for Nurses, 2008.
  • [35] NHS. National clinical coding standards icd-10, 2017.
  • [36] K. J. O’malley, K. F. Cook, M. D. Price, K. R. Wildes, J. F. Hurdle, and C. M. Ashton. Measuring diagnoses: Icd code accuracy. Health Services Research, 2005.
  • [37] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv, 2016.
  • [38] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals. Learning the speech front-end with raw waveform cldnns. In Interspeech, 2015.
  • [39] A. Sriram, H. Jun, S. Satheesh, and A. Coates. Cold fusion: Training seq2seq models together with language models. arXiv, 2017.
  • [40] M. Subotin and A. R. Davis. A method for modeling co-occurrence propensity of clinical codes with application to icd-10-pcs auto-coding. Journal of the American Medical Informatics Association, 2016.
  • [41] B. C. Sun, R. Y. Hsia, R. E. Weiss, D. Zingmond, L.-J. Liang, W. Han, H. McCreath, and S. M. Asch. Effect of emergency department crowding on outcomes of admitted patients. Annals of Emergency Medicine, 61(6):605–611, 2013.
  • [42] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Neural Information Processing Systems, 2014.
  • [43] E. Tutubalina and Z. Miftahutdinov. An encoder-decoder model for icd-10 coding of death certificates. arXiv, 2017.
  • [44] US GAO. Hospital emergency departments: Crowding continues to occur, and some patients wait longer than recommended time frames. Government Accountability Office, GAO-09-347, 2009.
  • [45] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. arXiv, 2018.
  • [46] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al.

    Google’s neural machine translation system: Bridging the gap between human and machine translation.

    arXiv, 2016.
  • [47] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv, 2015.