De-identification of Unstructured Clinical Texts from Sequence to Sequence Perspective

In this work, we propose a novel problem formulation for de-identification of unstructured clinical text. We formulate the de-identification problem as a sequence to sequence learning problem instead of a token classification problem. Our approach is inspired by the recent state-of -the-art performance of sequence to sequence learning models for named entity recognition. Early experimentation of our proposed approach achieved 98.91 dataset. This performance is comparable to current state-of-the-art models for unstructured clinical text de-identification.



There are no comments yet.


page 1

page 2

page 3


CopyNext: Explicit Span Copying and Alignment in Sequence to Sequence Models

Copy mechanisms are employed in sequence to sequence models (seq2seq) to...

Building a Word Segmenter for Sanskrit Overnight

There is an abundance of digitised texts available in Sanskrit. However,...

Sequence-to-Sequence Modeling for Action Identification at High Temporal Resolution

Automatic action identification from video and kinematic data is an impo...

3D Convolutional Sequence to Sequence Model for Vertebral Compression Fractures Identification in CT

An osteoporosis-related fracture occurs every three seconds worldwide, a...

TTTTTackling WinoGrande Schemas

We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrand...

Structured Multi-Label Biomedical Text Tagging via Attentive Neural Tree Decoding

We propose a model for tagging unstructured texts with an arbitrary numb...

DeepNorm-A Deep Learning Approach to Text Normalization

This paper presents an simple yet sophisticated approach to the challeng...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

A recent study (Toscano et al., 2018) found out that about 95% of the eligible hospitals and 62% of all office-based physicians used Electronic Health Records (EHR) systems. The EHR records contain patient information, history, medication prescription, vital signs, laboratory test results etc. Moreover, they contain unstructured notes from physicians regarding the patient which are rich sources of contextual medical information. Statistical analysis of EHR can lead to less error-prone medical diagnosis, healthcare cost reduction and better short term preventive care (Fernández-Alemán et al., 2013).

However, the analysis of EHR for various clinically significant statistical task is not straightforward. EHRs often contain sensitive identifying information regarding patients which are generally considered private. To be specific, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires 18 re-identifying categories of information to be sanitized from EHR before dissemination (of Health and Services, ). These categories are called personal health information (PHI) and include patients name, profession, unique identifying numbers such as social security number, driving license number or medical insurance number etc. HIPAA precludes researchers from performing analysis on large scale EHR repositories unless they are de-identified.

Using human annotators for de-identifying EHR is a costly process. One estimate by 

(Douglass et al., 2005) reported a cost of $50/hour for human annotators who read around 20000 words per hour. At this rate, it will cost $250,000 to annotate a dataset that have 100 million words. Moreover, this EHR de-identification by human annotation is also error prone. Prior work (Neamatullah et al., 2008) reported that recall value varied between 63-94% for 13 human annotators when they were asked to de-identify approximately 130 patient notes. Therefore, it is likely that multiple human annotators will be required to annotate the same patient notes to ensure the quality of de-identified text. In that case, the cost of de-identification will increase even more.

The shortcomings of manual EHR de-identification process led to extensive research in the domain of automated EHR de-identification model. Before the widespread adoption of deep learning in natural language processing (NLP) tasks, majority of automated EHR de-identification systems adopted rule-based approach 

(Douglass et al., 2005)

. Rule-based approaches suffer from multiple shortcomings. Firstly, they are sensitive to dataset. Rules that work for one dataset may require extensive calibration for a different dataset. Secondly, rules fail to take context into account. Consequently, rule based approaches are not scalable and have poor performance generally. Supervised machine learning methods offered comparatively better performance and generalizability compared to rule-based methods. These models solve the de-identification problem as binary (PHI and non-PHI) or categorical (types of PHI) classification problem. The shortcoming of these systems is that they rely on handcrafted features which are extracted from the tokens or sentences. Deep learning solves this dependency by learning useful features from text data without any human intervention. Deep learning methods that use non-linear neural network models have shown state-of-the-art performance in natural language understanding tasks such as named entity recognition 

(Lample et al., 2016), part of speech tagging (Gui et al., 2017)

, neural machine translation 

(Bahdanau et al., 2015)

etc. All of these tasks have been solved by some form of recurrent neural network (RNN). In 2018, Google proposed BERT 

(Devlin et al., 2019), a state-of-the art neural network architecture that set the benchmark for a range of NLP tasks. BERT used a new mechanism called “self-attention”. The current state-of-the-art EHR de-identification model (Ahmed et al., 2020) also employs attention based approach to get improved results in benchmark datasets.

Motivation. During literature review, we noted that the earliest EHR de-identification model is dated back in 1996 which was called SCRUB (Sweeney, 1996). Subsequent 25 years of academic research produced a large body of work for EHR de-identification. During this period, methods gradually progressed from rule based to deep learning models. Proliferation of benchmark datasets such as i2b2 (Stubbs et al., 2015), MIMIC-II (Lee et al., 2011) and MIMIC-III (Johnson et al., 2016) also played a pivotal role in this development. Current state-of-the-art (Ahmed et al., 2020) achieves a recall rate (de-identification accuracy) of 98.41%, 82.9% and 100.0% respectively for these datasets. We observe that recall rate for the MIMIC-II dataset is particularly low. Therefore, we believe there is room for improvement in terms of model performance on benchmark datasets. Moreover, prior works modeled clinical text de-identification as a classification problem. We note that de-identification problem can be modeled differently as well (i.e., sequence to sequence modeling, metric learning). However, these approaches have not been explored in literature yet.

Figure 1. Classification vs. Sequence to Sequence Modeling

Contribution. We model the de-identification problem as a sequence to sequence learning problem. Our approach is inspired by the recent advancement in named entity recognition in (Chen and Moschitti, 2018)

which combined transfer learning and sequence to sequence modeling. To the best of our knowledge, this is the first work that models de-identification problem as a sequence to sequence learning problem. The summary of our contributions are:

  • We propose a novel sequence-to-sequence learning based formulation of the unstructured clinical text de-identification problem.

  • Our proposed method achieves 98.91% recall for i2b2 dataset.

2. Problem Definition

The problem we tackle in this work is the unstructured clinical text de-identification problem. Let us consider a unstructured clinical text which consists of tokens . HIPAA identifies 18 types of PHI which are needed to be removed from EHR before public release. Each token either belongs to one of the classes of PHI or not.

Consider a function . It takes as input and produces a sequence of tokens, . Equation 1 defines each .


is a special token to replace the PHI tokens. It can be same for all PHI classes or different for each PHI classes. The objective of the de-identification problem is to find the optimal so that

results in maximum precision and recall.

3. Proposed Solution

We propose a encoder-decoder architecture for unstructured clinical text de-identification. This is the standard architecture in multiple natural language processing tasks such as named entity recognition (Chen and Moschitti, 2018), machine translation (Bahdanau et al., 2015)

, etc. Both encoder and decoder consist of “multi head self-attention” layers. From the encoder perspective, the attention layers encode a variable length input sequence into a fixed length context vector. Similarly, from the decoder perspective, the attention layers decode a fixed length context vector to a variable length output sequence. During the training process, the model learns about the context and features of the PHI tokens. The decoder then maps the non-PHI tokens to their input and the PHI tokens to the special token.

Figure 1(a) shows the problem formulations used in prior works. The de-identification model takes a sequence of tokens as input and output is a vector that contains the label for each token. 0 stands for non-PHI and 1 stands for PHI. On the other hand, Figure 1(b) shows our proposed solution model. The input is the same as prior works. The output is a sequence of tokens where the input PHI-tokens have been replaced with a special token REDACTED. This is a fundamental difference in our approach compared to the prior works. In prior approaches, the tokens were mapped to only two classes. In our approach, each token is mapped to itself if it is a non-PHI token. If it is a PHI token, then it is mapped to a special token. In other words, the output set of prior approaches are {0,1}. In our approach, the output set consists of all possible input tokens and the special tokens.

An important difference between our approach and the prior approaches is the model architecture. The current deep learning models for de-identification learn a task specific encoding before applying that encoding for token classification (Ahmed et al., 2020; Dernoncourt et al., 2017)

. Therefore, current architectures are encoder-classifier models. On the contrary, our proposed architecture is an encoder-decoder model. The encoder model learns to encode unstructured clinical texts. The decoder model learns to replace the PHI tokens with special token. In other words, the decoder model

translates the input sequence to a de-identified output sequence.

Let us explain this difference with an example. In Figure 1(a), the non PHI tokens {Doctor, Did, Not, Prescribe, Insulin, For, Mrs} are mapped to 0 and PHI tokens {Matthew, Edelson} are mapped to 1. Let’s consider the example in Figure 1(b). We assume that each token in the input sequence is mapped to id 1-9 and the token REDACTED is mapped to id 10. In the output sequence, each non-PHI tokens are mapped to their own token ids while the PHI tokens are mapped to id 10. In other words, the input sequence “Doctor Matthew did not prescribe insulin for Mrs. Edelson” becomes {1,2,3,4,5,6,7,8,9} before being fed into encoder. The decoder produces {1,10,3,4,5,6,7,8,10} which is mapped back to “Doctor REDACTED did not prescribe insulin for Mrs. REDACTED”.

4. Preliminary Result


We implemented the encoder-decoder model for sequence to sequence learning in python. We used Tensorflow 2.0 as implementation framework. Both encoder and decoder model consists of an embedding layer and 8 multi-head self-attention layers. We used “weighted softmax crossentropy” as loss function. ADAM optimizer was used for training optimization. We set the initial learning rate to 0.002.

We evaluate the performance of our proposed model based on recall value. Recall value represents how many of the PHI tokens have been replaced by the special tokens in the output sequences. We put higher emphasis on recall than precision111This is because in the worst case scenario, a single PHI token missed by the model can re-identify the clinical text.

Experimental Result. We perform evaluation of our proposed model on the i2b2 2014 challenge task dataset. We present our result in Table 1. We achieved higher recall and F1-score than the current state-of-the-art (SOTA) model in literature. However, our precision score is slightly less than SOTA. We note that, the number of parameters in our model is comparatively lower than SOTA which may have played a role in this case. We aim to increase number of model parameters to achieve better precision in future iterations. We are currently performing experimentation on MIMIC-II and MIMIC-III datasets. The results from these experiments will be included in our future work.

5. Future Work

Experiment on MIMIC-II and III. We will perform experiment on MIMIC-II and III datasets. During literature review, we noted that current models have very low recall rate in MIMIC-II. We aim to improve that recall rate. Moreover, we plan to generate additional unstructured text data from MIMIC-II and III dataset and train our model on it to achieve greater robustness.

Transfer Learning. We note that named entity recognition problem has similarity with unstructured clinical text de-identification. Literature review revealed that transfer learning improves the accuracy and generalizability of named entity recognition models. To the best of our knowledge transfer learning has not been leveraged for clinical text de-identification problem. We intend to explore this direction in our future works.

Semi Supervised Learning.

An important aspect of de- identification model training is the unavailability of ground truth data. Manual ground truth data annotation is an expensive and error-prone process. Semi-supervised learning can help to mitigate this challenge by training models on partially labeled dataset. Literature review of de-identification models reveal that semi-supervised learning have not been leveraged yet for clinical text de-identification. We will explore this training method for our models in our future works.

Domain Adaptation. A common problem among current de- identification models is that domain shift issue. Models trained on one dataset generally perform very poorly on other datasets. Tzeng et al. showed that this issue can be mitigated by training in the models using adversarial domain adaptation (Tzeng et al., 2017). We intend to explore this training approach in future iterations of this work.

6. Conclusion

In this work we presented a novel sequence-to-sequence problem formulation for the clinical text de-identification problem. We designed an encoder-decoder architecture to translated unstructured clinical texts with PHI tokens to sanitized clinical texts without the PHI tokens. Preliminary analysis of our proposed method shows promising evaluation metric scores. We are currently experimenting on other benchmark datasets to assess the effectiveness of our problem formulation and solution model.

Method # parameters Precision Recall F1-Score
Dernoncourt et al. (Dernoncourt et al., 2017) N/A 97.92 97.84 97.88
Khin et al. (Khin et al., 2018) N/A 98.30 97.37 97.83
Tanbir et al. (Ahmed et al., 2020) (SOTA) 110,000,000 99.01 98.41 98.22
Proposed Method 78,000,000 98.12 98.91 98.51
Table 1. Preliminary experimental evaluation of our proposed method on i2b2 2014 dataset. The best results are marked in bold.


  • T. Ahmed, M. M. Al Aziz, and N. Mohammed (2020) De-identification of electronic health record using neural network. Scientific reports 10 (1), pp. 1–11. Cited by: §1, §1, §3, Table 1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proccedings of International Conference On Learning Representations, Cited by: §1, §3.
  • L. Chen and A. Moschitti (2018) Learning to progressively recognize new named entities with sequence to sequence models. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2181–2191. Cited by: §1, §3.
  • F. Dernoncourt, J. Y. Lee, O. Uzuner, and P. Szolovits (2017) De-identification of patient notes with recurrent neural networks. Journal of the American Medical Informatics Association 24 (3), pp. 596–606. Cited by: §3, Table 1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1.
  • M. Douglass, G. Cliffford, A. Reisner, W. Long, G. Moody, and R. Mark (2005) De-identification algorithm for free-text nursing notes. In Computers in Cardiology, 2005, pp. 331–334. Cited by: §1, §1.
  • J. L. Fernández-Alemán, I. C. Señor, P. Á. O. Lozoya, and A. Toval (2013) Security and privacy in electronic health records: a systematic literature review. Journal of biomedical informatics 46, pp. 541–562. Cited by: §1.
  • T. Gui, Q. Zhang, H. Huang, M. Peng, and X. Huang (2017) Part-of-speech tagging for twitter with adversarial neural networks. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2411–2420. Cited by: §1.
  • A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3 (1), pp. 1–9. Cited by: §1.
  • K. Khin, P. Burckhardt, and R. Padman (2018) A deep learning architecture for de-identification of patient notes: implementation and evaluation. arXiv preprint arXiv:1810.01570. Cited by: Table 1.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. External Links: Link, Document Cited by: §1.
  • J. Lee, D. J. Scott, M. Villarroel, G. D. Clifford, M. Saeed, and R. G. Mark (2011) Open-access mimic-ii database for intensive care research. In 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 8315–8318. Cited by: §1.
  • I. Neamatullah, M. M. Douglass, H. L. Li-wei, A. Reisner, M. Villarroel, W. J. Long, P. Szolovits, G. B. Moody, R. G. Mark, and G. D. Clifford (2008) Automated de-identification of free-text medical records. BMC medical informatics and decision making 8 (1), pp. 1–17. Cited by: §1.
  • [14] U.S. D. of Health and H. Services U.s. health insurance portability and accountability act (hipaa) privacy rule.. Note:[Online; accessed 19-July-2021] Cited by: §1.
  • A. Stubbs, C. Kotfila, and Ö. Uzuner (2015) Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 i2b2/uthealth shared task track 1. Journal of biomedical informatics 58, pp. S11–S19. Cited by: §1.
  • L. Sweeney (1996) Replacing personally-identifying information in medical records, the scrub system.. In Proceedings of the AMIA annual fall symposium, pp. 333. Cited by: §1.
  • F. Toscano, E. O’Donnell, M. Unruh, D. Golinelli, G. Carullo, G. Messina, and L. Casalino (2018) Electronic health records implementation: can the european union learn from the united states?. European Journal of Public Health 28, pp. 213–401. Cited by: §1.
  • E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell (2017) Adversarial discriminative domain adaptation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 7167–7176. Cited by: §5.