1 Introduction111A version of this paper was submitted to the 28th Workshop on Information Technologies and Systems
Electronic Health Records (EHR) have become ubiquitous in recent years in the United States, owing much to the The Health Information Technology for Economic and Clinical Health (HITECH) Act of 2009.  Their ubiquity have given researchers a treasure trove of new data, especially in the realm of unstructured textual data. However, this new data source comes with usage restrictions in order to preserve the privacy of individual patients as mandated by the Health Insurance Portability and Accountability Act (HIPAA). HIPAA demands any researcher using this sensitive data to first strip the medical records of any protected health information (PHI), a process known as de-identification.
HIPAA allows for two methods for de-identifying PHIs: the “Expert Determination” method in which an expert certifies that the information is rendered not individually identifiable, and the “Safe Harbor” method in which 18 identifiers are removed or replaced with random data in order for the data to be considered not individually identifiable. Our research pertains to the second method (a list of the relevant identifiers can be seen in Table 1).
The process of de-identification has been largely a manual and labor intensive task due to both the sensitive nature of the data and the limited availability of software to automate the task. This has led to a relatively small number of open health data sets available for public use. Recently, there have been two well-known de-identification challenges organized by Informatics for Integrating Biology and the Bedside (i2b2) to encourage innovation in the field of de-identification.
In this paper, we build on the recent advances in natural language processing, especially with regards to word embeddings, by incorporating deep contextualized word embeddings developed by Peters et al. 
into a deep learning architecture. More precisely, we present a deep learning architecture that differs from current architectures in literature by using bi-directional long short-term memory networks (Bi-LSTMs) with variational dropouts and deep contextualized word embeddings while also using components already present in other systems such traditional word embeddings, character LSTM embeddings and conditional random fields. We test this architecture on two gold standard data sets, the 2014 i2b2 de-identification Track 1 data set and the nursing notes corpus . The architecture achieves state-of-the-art performance on both data sets while also achieving faster convergence without the use of dictionaries (or gazetteers) or other rule-based methods that are typically used in other de-identification systems.
The paper is organized as follows: In Section 2, we review the latest literature around techniques for de-identification with an emphasis on related work using deep learning techniques. In Section 3, we detail our deep learning architecture and also describe how we use the deep contextualized word embeddings method to improve our results. Section 4
describes the two data sets we will use to evaluate our method and our evaluation metrics. Section5 presents the performance of our architecture on the data sets. In Section 6, we discuss the results and provide an analysis of the errors. Finally, in Section 7, we summarize our contributions while also discussing possible future research.
|2||All geographic subdivisions smaller than a state|
|7||Device Identifiers and Serial Numbers|
|10||Social Security Numbers|
|11||Medical Record Numbers|
|14||Health Plan Beneficiary Numbers|
|15||Full-face photographic images and any comparable images|
|18||Any other unique identifying number, characteristic, or code.|
2 Background and Related Work
The task of automatic de-identification has been heavily studied recently, in part due to two main challenges organized by i2b2 in 2006 and in 2014. The task of de-identification can be classified as a named entity recognition (NER) problem which has been extensively studied in machine learning literature. Automated de-identification systems can be roughly broken down into four main categories:
Machine Learning Systems
Deep Learning Systems 222Deep learning is technically a subset of machine learning
2.1 Rule-based Systems
Rule-based systems make heavy use of pattern matching such as dictionaries (or gazetteers), regular expressions and other patterns. Systems such as the ones described in [6, 7]
do not require the use any labeled data. Hence, they are considered as unsupervised learning systems. Advantages of such systems include their ease of use, ease of adding new patterns and easy interpretability. However, these methods suffer from a lack of robustness with regards to the input. For example, different casings of the same word could be misinterpreted as an unknown word. Furthermore, typographical errors are almost always present in most documents and rule-based systems often cannot correctly handle these types of inaccuracies present in the data. Critically, these systems cannot handle context which could render a medical text unreadable. For example, a diagnosis of “Lou Gehring disease” could be misidentified by such a system as a PHI of typeName. The system might replace the tokens “Lou” and “Gehring” with randomized names rendering the text meaningless if enough of these tokens were replaced.
2.2 Machine Learning Systems
. In machine learning systems, given a sequence of input vectors, a machine learning algorithm outputs label predictions 10] have been used for building de-identification systems.
These machine learning-based systems have the advantage of being able to recognize complex patterns that are not as readily evident to the naked eye. However, the drawback of such ML-based systems is that since classification is a supervised learning task, most of the common classification algorithms require a large labeled data set for robust models. Furthermore, since most of the algorithms described in the last paragraph maximize the likelihood of a labelgiven an vector of inputs , rare patterns that might not occur in the training data set would be misclassified as not being a PHI label. Furthermore, these models might not be generalizable to other text corpora that contain significantly different patterns such as sentence structures and uses of different abbreviated words found commonly in medical notes than the training data set.
2.3 Hybrid Systems
With both the advantages and disadvantages of stand alone rule-based and ML-based systems well-documented, systems such as the ones detailed in  combined both ML and rule-based systems to achieve impressive results. Systems such as the ones presented for 2014 i2b2 challenge by Yang et al.  and Liu et al.  used dictionary look-ups, regular expressions and CRFs to achieve accuracies of well over 90% in identifying PHIs.
It is important to note that such hybrid systems rely heavily on feature engineering, a process that manufactures new features from the data that are not present in the raw text. Most machine learning techniques, for example, cannot take text as an input. They require the text to be represented as a vector of numbers. An example of such features can be seen in the system that won the 2014 i2b2 de-identification challenge by Yang et al. . Their system uses token features such as part-of-speech tagging and chunking, contextual features such as word lemma and POS tags of neighboring words, orthographic features such as capitalization and punctuation marks and task-specific features such as building a list that included all the full names, acronyms of US states and collecting TF-IDF-statistics. Although such hybrid systems achieve impressive results, the task of feature engineering is a time-intensive task that might not be generalizable to other text corpora.
2.4 Deep Learning Systems
With the disadvantages of the past three approaches to building a de-identification system in mind, the current state-of-the-art systems employ deep learning techniques to achieve better results than the best hybrid systems while also not requiring the time-consuming process of feature engineering. Deep learning is a subset of machine learning that uses multiple layers of Artificial Neural Networks (ANNs), which has been very succesful at most Natural Language Processing (NLP) tasks. Recent advances in the field of deep learning and NLP especially in regards to named entity recognition have allowed systems such as the one by Dernoncourt et al. to achieve better results on the 2014 i2b2 de-identification challenge data set than the winning hybrid system proposed by Yang et al. . The advances in NLP and deep learning which have allowed for this performance are detailed below.
ANNs cannot take words as inputs and require numeric inputs, therefore, past approaches to using ANNs for NLP have been to employ a bag-of-words (BoW) representation of words where a dictionary is built of all known words and each word in a sentence is assigned a unique vector that is inputted into the ANN. A drawback of such a technique is such that words that have similar meanings are represented completely different. As a solution to this problem, a technique called word embeddings have been used. Word embeddings gained popularity when Mikolov et al.  used ANNs to generate a distributed vector representation of a word based on the usage of the word in a text corpus. This way of representing words allowed for similar words to be represented using vectors of similar values while also allowing for complex operations such as the famous example: , where represents a vector for a particular word.
While pre-trained word embeddings such as the widely used GloVe  embeddings are revolutionary and powerful, such representations only capture one context representation, namely the one of the training corpus they were derived from. This shortcoming has led to the very recent development of context-dependent representations such as the ones developed by [2, 14], which can capture different features of a word.
The Embeddings from Language Models (ELMo) from the system by Peters et al.  are used by the architecture in this paper to achieve state-of-the-art results. The ELMo representations, learned by combining Bi-LSTMs with a language modeling objective, captures context-depended aspects at the higher-level LSTM while the lower-level LSTM captures aspects of syntax. Moreover, the outputs of the different layers of the system can be used independently or averaged to output embeddings that significantly improve some existing models for solving NLP problems. These results drive our motivation to include the ELMo representations in our architecture.
2.4.2 Neural Networks
The use of ANNs for many machine learning tasks has gained popularity in recent years. Recently, a variant of recurrent neural networks (RNN) called Bi-directional Long Short-Term Memory (Bi-LSTM) networks has been successfully employed especially in the realm of NER.
Our architecture incorporates most of the recent advances in NLP and NER while also differing from other architectures described in the previous section by use of deep contextualized word embeddings, Bi-LSTMs with a variational dropout and the use of the Adam optimizer. Our architecture can be broken down into four distinct layers: pre-processing, embeddings, Bi-LSTM and CRF classifier. A graphical illustration of the architecture can be seen in Figure 1 while a summary of the parameters for our architecture can be found in Table 2.
|ELMo embeddings||Averaged Layers||1024|
|Token embedding||GloVe 3||300|
|Part-of-speech embeddings||Generated from nltk||20|
|Character LSTMs||Two LSTMs||25|
3.1 Pre-processing Layer
For a given document , we first break down the document into sentences , tokens and characters where represents the document number, represents the sentence number, represents the token number, and represents the character number. For example, Patient, where the token: “Patient” represents the 3rd token of the 2nd sentence of the 1st document.
After parsing the tokens, we use a widely used and readily available Python toolkit called Natural Langauge ToolKit (NLTK) to generate a part-of-speech (POS) tag for each token. This generates a POS feature for each token which we will transform into a 20-dimensional one-hot-encoded input vector,, then feed into the main LSTM layer.
For the data labels, since the data labels can be made up of multiple tokens, we formatted the labels to the BIO scheme. The BIO scheme tags the beginning of a PHI with a B-, the rest of the same PHI tokens as I- and the rest of the tokens not associated with a PHI as O. For example, the sentence, “The patient is sixty seven years old”, would have the corresponding labels, “O O O B-AGE I-AGE O O”.
3.2 Embedding Layer
For the embedding layer, we use three main types of embeddings to represent our input text: traditional word embeddings, ELMo embeddings and character-level LSTM embeddings.
The traditional word embeddings use the latest GloVe 3  pre-trained word vectors that were trained on the Common Crawl with about 840 billion tokens. For every token input, , the GloVe system outputs , a dense 300-dimensional word vector representation of that same token. We also experimented with other word embeddings by using the bio-medical corpus trained word embeddings  to see if having word embeddings trained on medical texts will have an impact on our results.
As mentioned in previous sections, we also incorporate the powerful ELMo representations as a feature to our Bi-LSTMs. The specifics of the ELMo representations are detailed in . In short, we compute an ELMo representation by passing a token input to the ELMo network and averaging the the layers of the network to produce an 1024-dimensional ELMo vector, .
Character-level information can capture some information about the token itself while also mitigating issues such as unseen words and misspellings. While lemmatizing (i.e., the act of turning inflected forms of a word to their base or dictionary form) of a token can solve these issues, tokens such as the ones found in medical texts could have important distinctions between, for example, the grammar form of the token. As such, Ma et al. 
have used Convolutional Neural Networks (CNN) while Lample et al. have used Bi-LSTMs to produce character-enhanced representations of each unique token. We have utilized the latter approach of using Bi-LSTMs for produce a character-enhanced embedding for each unique word in our data set.333Words appearing in the test set that do not appear in the training set are mapped to the UNKNOWN representation Our parameters for the forward and backward LSTMs are 25 each and the maximum character length is 25, which results in an 50-dimensional embedding vector, , for each token.
After creating the three embeddings for each token, , we concatenate the GloVe and ELMo representations to produce a single 1324-dimensional word input vector, . The concatenated word vector is then further concatenated with the character embedding vector, , POS one-hot-encoded vector, , and the casing embedded vector, , to produce a single 1394-dimensional input vector, , that we feed into our Bi-LSTM layer.
3.3 Bi-LSTM Layer
The Bi-LSTM layer is composed of two LSTM layers, which are a variant of the Bidirectional RNNs. In short, the Bi-LSTM layer contains two independent LSTMs in which one network is fed input in the normal time direction while the other network is fed input in the reverse time direction. The outputs of the two networks can then be combined using either summation, multiplication, concatenation or averaging. Our architecture uses simple concatenation to combine the outputs of the two networks.
Our architecture for the Bi-LSTM layer is similar to the ones used by [17, 18, 19] with each LSTM containing 100 hidden units. To ensure that the neural networks do not overfit, we use a variant of the popular dropout technique called variational dropout  to regularize our neural networks. Variational dropout differs from the traditional naïve dropout technique by having the same dropout mask for the inputs, outputs and the recurrent layers . This is in contrast to the traditional technique of applying a different dropout mask for each of the input and output layers. 
shows that variational dropout applied to the output and recurrent units performs significantly better than naïve dropout or no dropout for the NER tasks. As such, we apply a dropout probability of 0.5 for both the output and the recurrent units in our architecture.
3.4 CRF layer
As a final step, the outputs of the Bi-LSTM layer are inputted into a linear-chain CRF classifier, which maximizes the label probabilities of the entire input sentence. This approach is identical to the Bi-LSTM-CRF model by Huang et al.  CRFs have been incorporated in numerous state-of-the-art models [17, 19, 4] because of their ability to incorporate tag information at the sentence level.
While the Bi-LSTM layer takes information from the context into account when generating its label predictions, each decision is independent from the other labels in the sentence. The CRF allows us to find the labeling sequence in a sentence with the highest probability. This way, both previous and subsequent label information is used in determining the label of a given token. As a sequence model, the CRF posits a probability model for the label sequence of the tokens in a sentence, conditional on the word sequence and the output scores from the Bi-LTSM model for the given sentence. In doing so, the CRF models the conditional distribution of the label sequence instead of a joint distribution with the words and output scores. Thus, it does not assume independent features, while at the same time not making strong distributional assumptions about the relationship between the features and sequence labels.
4 Data and Evaluation Metrics
The i2b2 corpus was used by all tracks of the 2014 i2b2 challenge. It consists of 1,304 patient progress notes for 296 diabetic patients. All the PHIs were removed and replaced with random replacements. The PHIs in this data set were broken down first into the HIPAA categories and then into the i2b2-PHI categories as shown in Table 3. Overall, the data set contains 56,348 sentences with 984,723 separate tokens of which 41,355 are separate PHI tokens, which represent 28,867 separate PHI instances. For our test-train-valid split, we chose 10% of the training sentences to serve as our validation set, which represents 3,381 sentences while a separately held-out official test data set was specified by the competition. This test data set contains 22,541 sentences including 15,275 separate PHI tokens.
The nursing notes were originally collected by Neamatullah et al. . The data set contains 2,434 notes of which there are 1,724 separate PHI instances. A summary of the breakdown of the PHI categories of this nursing corpora can be seen in Table 3.
|Name||Patient, Doctor,||Patient Name, Initial|
|Location||Street, City, State,||Location|
|Contact||Phone, Fax, Email,||Phone|
|URL, IP Address|
|ID||Medical Record, ID No,||-|
|SSN , License No|
4.1 Evaluation Metrics
For de-identification tasks, the three metrics we will use to evaluate the performance of our architecture are Precision, Recall and score as defined below. We will compute both the binary score and the three metrics for each PHI type for both data sets. Note that binary score calculates whether or not a token was identified as a PHI as opposed to correctly predicting the right PHI type. For de-identification, we place more importance on identifying if a token was a PHI instance with correctly predicting the right PHI type as a secondary objective.
Notice that a high recall is paramount given the risk of accidentally disclosing sensitive patient information if not all PHI are detected and removed from the document or replaced by fake data. A high precision is also desired to preserve the integrity of the documents, as a large number of false positives might obscure the meaning of the text or even distort it. As the harmonic mean of precision and recall, thescore gives an overall measure for model performance that is frequently employed in the NLP literature.
As a benchmark, we will use the results of the systems by Burckhardt et al. , Liu et al. , Dernoncourt et al. and Yang et al.  on the i2b2 dataset and the performance of Burckhardt et al. on the nursing corpus. Note that Burckhardt et al. used the entire data set for their results as it is an unsupervised learning system while we had to split our data set into 60% training data and 40% testing data.
We evaluated the architecture on both the i2b2-PHI categories and the HIPAA-PHI categories for the i2b2 data set based on token-level labels. Note that the HIPAA categories are a super set of the i2b2-PHI categories. We also ran the analysis 5+ times to give us a range of maximum scores for the different data sets.
Table 4 gives us a summary of how our architecture performed against other systems on the binary score metrics while Table 5 and Table 6 summarizes the performance of our architecture against other systems on HIPAA-PHI categories and i2b2-PHI categories respectively. Table 7 presents a summary of the performance on the nursing note corpus while also contrasting the performances achieved by the deidentify system.
|Model||i2b2 data set|
|Dernoncourt et al. ||0.9792||0.9783||0.9787|
|Liu et al. ||0.9646||0.9380||0.9511|
|Yang et al. ||0.9815||0.9414||0.9611|
|Burckhardt et al. ||0.5989||0.8296||0.6957|
|PHI Type||Our architecture||Yang et al.|
|PHI Category||PHI Sub-Type||Burckhardt et al.||Our architecture|
6 Discussion and Error Analysis
As we can see in Table 5, with the exception of ID, our architecture performs considerably better than systems by Liu et al. and Yang et al. Dernoncourt et al. did not provide exact figures for the HIPAA-PHI categories so we have excluded them from our analysis. Furthermore, Table 4 shows that our architecture performs similarly to the best scores achieved by Dernoncourt et al., with our architecture slightly edging out Dernoncourt et al. on the precision metric. For the nursing corpus, our system, while not performing as well as the performances on i2b2 data set, managed to best the scores achieved by the deidentify system while also achieving a binary score of over 0.812. It is important to note that deidentify was a unsupervised learning system, it did not require the use of a train-valid-test split and therefore, used the whole data set for their performance numbers. The results of our architecture is assessed using a 60%/40% train/test split.
4 epochs. A possible explanation for this is due to our architecture using the Adam optimizer, whereas the NeuroNER system use the Stochastic Gradient Descent (SGD) optimizer. In fact, Reimers et al. show that the SGD optimizer performed considerably worse than the Adam optimizer for different NLP tasks.
Furthermore, we also do not see any noticeable improvements from using the PubMed database trained word embeddings  instead of the general text trained GloVe word embeddings. In fact, we consistently saw better scores using the GloVe embeddings. This could be due to the fact that our use case was for identifying general labels such as Names, Phones, Locations etc. instead of bio-medical specific terms such as diseases which are far better represented in the PubMed corpus.
6.1 Error Analysis
6.1.1 i2b2 Dataset
We will mainly focus on the two PHI categories: Profession and ID for our error analysis on the i2b2 data set. It is interesting to note that the best performing models on the i2b2 data set by Dernoncourt et al. 
experienced similar lower performances on the same two categories. However, we note the performances by Dernoncourt et al. were achieved using a “combination of n-gram, morphological, orthographic and gazetteer features” while our architecture uses only POS tagging as an external feature. Dernoncourt et al. posits that the lower performance on the Profession category might be due to the close embeddings of the Profession tokens to other PHI tokens which we can confirm on our architecture as well. Furthermore, our experiments show that the Profession PHI performs considerably better with the PubMed embedded model than GloVe embedded model. This could be due to the fact that PubMed embeddings were trained on the PubMed database, which is a database of medical literature. GloVe on the other hand was trained on a general database, which means the PubMed embeddings for Profession tokens might not be as close to other tokens as is the case for the GloVe embeddings.
For the ID PHI, our analysis shows that some of the errors were due to tokenization errors. For example, a “:” was counted as PHI token which our architecture correctly predicted as not a PHI token. Since our architecture is not custom tailored to detect sophisticated ID patterns such as the systems in [10, 11], we have failed to detect some ID PHIs such as “265-01-73”, a medical record number, which our architecture predicted as a phone number due to the format of the number. Such errors could easily be mitigated by the use of simple regular expressions.
6.1.2 Nursing Dataset
We can see that our architecture outperforms the deidentify system by a considerable margin on most categories as measured by the score. For example, the authors of deidentify note that Date PHIs have considerably low precision values while our architecture achieve a precision value of greater than 0.915% for the Date PHI. However, Burckhardt et al.  achieve an impressive precision of 0.899 and recall of 1.0 for the Phone PHI while our architecture only manages 0.778 and 0.583 respectively. Our analysis of this category shows that this is mainly due a difference in tokenization, stand alone number are being classified as not a PHI.
We tried to use the model that we trained on the i2b2 data set to predict the categories of the nursing data set. However, due to difference in the text structure, the actual text and the format, we achieved less than random performance on the nursing data set. This brings up an important point about the transferability of such models.
6.2 Ablation Analysis
Our ablation analysis shows us that the layers of our models adds to the overall performance. Figure 2 shows the binary scores on the i2b2 data set with each bar being a feature toggled off. For example, the “No Char Embd” bar shows the performance of the model with no character embeddings and everything else the same as our best model.
We can see a noticeable change in the performance if we do not include the ELMo embeddings versus no GloVe embeddings. The slight decrease in performance when we use no GloVe embeddings shows us that this is a feature we might choose to exclude if computation time is limited. Furthermore, we can see the impact of having no variational dropout and only using a naïve dropout, it shows that variational dropout is better at regularizing our neural network.
In this study, we show that our deep learning architecture, which incorporates the latest developments in contextual word embeddings and NLP, achieves state-of-the-art performance on two widely available gold standard de-identification data sets while also achieving similar performance as the best system available in less epochs. Our architecture also significantly improves over the performance of the hybrid system deidentify on the nursing data set.
This architecture could be integrated into a client-ready system such as the deidentify system. However, as mentioned in Section 6, the use of a dictionary (or gazetter) might help improve the model even further specially with regards to the Location and Profession PHI types. Such a hybrid system would be highly beneficial to practitioners that needs to de-identify patient data on a daily basis.
-  J. Adler-Milstein and A. K. Jha, “Hitech act drove large gains in hospital electronic health record adoption,” Health Affairs, vol. 36, no. 8, pp. 1416–1422, 2017.
-  M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” in Proc. of NAACL, 2018.
-  A. Stubbs, C. Kotfila, and Ö. Uzuner, “Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/uthealth shared task track 1,” Journal of Biomedical Informatics, vol. 58, no. nil, pp. S11–S19, 2015.
-  I. Neamatullah, M. M. Douglass, L. wei H Lehman, A. Reisner, M. Villarroel, W. J. Long, P. Szolovits, G. B. Moody, R. G. Mark, and G. D. Clifford, “Automated de-identification of free-text medical records,” BMC Medical Informatics and Decision Making, vol. 8, no. 1, p. 32, 2008.
-  O. f. C. R. HHS Office of the Secretary and OCR, “Methods for de-identification of phi,” Nov 2015.
-  F. P. Morrison, L. Li, A. M. Lai, and G. Hripcsak, “Repurposing the clinical record: Can an existing natural language processing system de-identify clinical notes?,” Journal of the American Medical Informatics Association, vol. 16, no. 1, pp. 37–39, 2009.
-  O. Uzuner, Y. Luo, and P. Szolovits, “Evaluating the state-of-the-art in automatic de-identification,” Journal of the American Medical Informatics Association, vol. 14, no. 5, pp. 550–563, 2007.
-  O. Ferrández, B. R. South, S. Shen, F. J. Friedlin, M. H. Samore, and S. M. Meystre, “Evaluating current automatic de-identification methods with veteran’s health administration clinical documents,” BMC Medical Research Methodology, vol. 12, no. 1, p. 109, 2012.
-  S. M. Meystre, F. J. Friedlin, B. R. South, S. Shen, and M. H. Samore, “Automatic de-identification of textual documents in the electronic health record: a review of recent research.,” BMC medical research methodology, vol. 10, p. 70, 2010.
-  F. Dernoncourt, J. Y. Lee, O. Uzuner, and P. Szolovits, “De-identification of patient notes with recurrent neural networks,” CoRR, 2016.
-  H. Yang and J. M. Garibaldi, “Automatic detection of protected health information from clinic narratives.,” Journal of biomedical informatics, vol. 58 Suppl, pp. S30–8, 2015.
-  Z. Liu, Y. Chen, B. Tang, X. Wang, Q. Chen, H. Li, J. Wang, Q. Deng, and S. Zhu, “Automatic de-identification of electronic medical records using token-level and character-level conditional random fields,” Journal of Biomedical Informatics, vol. 58, no. nil, pp. S47–S52, 2015.
-  J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in In EMNLP, 2014.
-  B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: Contextualized word vectors,” CoRR, 2017.
-  S. Pyysalo, F. Ginter, H. Moen, T. Salakoski, and S. Ananiadou, “Distributional semantics resources for biomedical text embeddingsfirst,” 01 2013.
-  X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional lstm-cnns-crf,” CoRR, 2016.
-  G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, “Neural architectures for named entity recognition,” CoRR, 2016.
-  F. Dernoncourt, J. Y. Lee, and P. Szolovits, “Neuroner: an easy-to-use program for named-entity recognition based on neural networks,” CoRR, 2017.
-  Z. Liu, B. Tang, X. Wang, and Q. Chen, “De-identification of clinical notes via recurrent neural network and conditional random field.,” Journal of biomedical informatics, vol. 75S, pp. S34–S42, 2017.
-  Y. Gal and Z. Ghahramani, “A theoretically grounded application of dropout in recurrent neural networks,” CoRR, 2015.
N. Reimers and I. Gurevych, “Optimal hyperparameters for deep lstm-networks for sequence labeling tasks,”CoRR, 2017.
-  Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” CoRR, 2015.
-  P. Burckhardt and R. Padman, “deidentify,” AMIA Annu Symp Proc, vol. 2017, pp. 485–494, 2017.