With the strong adoption of electronic health records (EHRs), large quantities of unstructured medical patient data become available. This data offers significant opportunities to advance medical research and to improve healthcare related services. However, it has to be ensured that the privacy of a patient is protected when performing secondary analysis of medical data. This is not only an ethical prerequisite, but also a legal requirement imposed by privacy legislations such as the US Health Insurance Portability and Accountability Act (HIPAA) (HIPAA, 1996) and the European General Data Protection Regulation (GDPR) (GDPR, 2016). To facilitate privacy protection, de-identification has been proposed as a process that removes or masks any kind of protected health information (PHI) of a patient such that it becomes difficult to establish a link between an individual and the data (Meystre, 2015). What type of information constitutes PHI is in part defined by privacy laws of the corresponding country. For instance, the HIPAA regulation defines 18 categories of PHI including names, geographic locations, and phone numbers (HIPAA, 2012). According to the HIPAA safe-harbor rule, data is no longer personally identifying and subject to the privacy regulation if these 18 PHI categories have been removed. As the GDPR does not provide such clear PHI definitions, we employ the HIPAA definitions throughout this paper.
As most EHRs consist of unstructured, free-form text, manual de-identification is a time-consuming and error-prone process which does not scale to the amounts of data needed for many data mining and machine learning scenarios (Neamatullah et al., 2008; Douglass et al., 2004). Therefore, automatic de-identification methods are desirable. Previous research proposed a wide range of methods that make use of natural language processing techniques including rule-based matching and machine learning (Meystre, 2015). However, most evaluations are constrained to medical records written in the English language. The generalizability of de-identification methods across languages and domains is largely unexplored.
To test the generalizability of existing de-identification methods, we annotated a new dataset of 1260 medical records from three sectors of Dutch healthcare: elderly care, mental care and disabled care (Section 3). Figure 1 shows an example record with annotated PHI. We then compare the performance of the following three de-identification methods on this data (Section 4):
A feature-based Conditional Random Field (CRF) as described in Liu et al. (2015)
We test the transferability of each method across three domains of Dutch healthcare. Finally, the generalizability of the methods is compared across languages using two widely used English benchmark corpora (Section 5).
This paper makes three main contributions. First, our experiments show that the only openly available de-identification method for the Dutch language fails to generalize to other Dutch medical domains. This highlights the importance of a thorough evaluation of the generalizability of de-identification methods. Second, we offer a novel comparison of several state-of-the-art de-identification methods both across languages and domains. Our experiments show that a popular neural architecture generalizes best even when limited amounts of training data are available. The neural method only considers word/character sequences which we find to be sufficient and more robust across languages and domains compared to the structural features employed by traditional machine learning approaches. However, our experiments also reveal that the neural method may still experience a substantially lower performance in new domains. A direct consequence for de-identification practitioners is that pre-trained models require additional fine-tuning to be fully applicable to new domains. Third, we share our pre-trained models and code with the research community. The creation of these resources is connected to a significant time effort and requires access to sensitive medical data. We anticipate that this resource is of direct value to text mining researchers.
2. Related Work
Previous work on de-identification can be roughly organized into four groups: (1) creation of benchmark corpora, (2) approaches to de-identification, (3) work on languages other than English, and (4) cross-domain de-identification.
Various English benchmark corpora have been created including nursing notes, longitudinal patient records and psychiatric intake notes (Neamatullah et al., 2008; Stubbs and Uzuner, 2015; Stubbs et al., 2017). Furthermore, Deléger et al. (2012) created a heterogeneous dataset comprised of 22 different document types. Contrary to the existing datasets which only contain records from at most two different medical institutes, the data used in this paper was sampled from a total of 9 institutes that are active in the Dutch healthcare sector. The contents, structure and writing style of the documents strongly depend on the processes and individuals specific to an institute which contributes to a heterogeneous corpus.
Most existing de-identification approaches are either rule-based or machine learning based. Rule-based methods combine various heuristics in form of patterns, lookup lists and fuzzy string matching to identify PHI(Gupta et al., 2004; Neamatullah et al., 2008). The majority of machine learning approaches employ feature-based CRFs (Aberdeen et al., 2010; He et al., 2015), ensembles combining CRFs with rules (Stubbs et al., 2015a) and most recently also neural networks (Dernoncourt et al., 2017; Liu et al., 2017). A thorough overview of the different de-identification methods is given in Meystre (2015). In this study, we compare several state-of-the-art de-identification methods. With respect to rule-based approaches, we apply DEDUCE, a recently developed method for Dutch data (Menger et al., 2018). To the best of our knowledge, this is the only openly available de-identification method tailored to Dutch data. For a feature-based machine learning method, we re-implement the token-level CRF by Liu et al. (2015). Previous work on neural de-identification used a BiLSTM-CRF architecture with character-level and ELMo embeddings (Dernoncourt et al., 2017; Khin et al., 2018). Similarly, we use a BiLSTM-CRF but apply recent advances in neural sequence modeling by using contextual string embeddings (Akbik et al., 2018).
To the best of our knowledge, we are the first study to offer a comparison of de-identification methods across languages. With respect to de-identification in languages other than English, only three studies consider Dutch data. Scheurwegs et al. (2013)et al. (2018) developed and released a rule-based method on 400 psychiatric nursing notes and treatment plans of a single Dutch hospital. Tjong Kim Sang et al. (2019) evaluated an existing named entity tagger for the de-identification of autobiographic emails on publicly available Wikipedia texts. Furthermore, de-identification in several other languages has been studied including German, French, Korean and Swedish (Richter-Pechanski et al., 2018; Névéol et al., 2018).
With respect to cross-domain de-identification, the 2016 CEGS N-GRID shared task evaluated the portability of pre-trained de-identification methods to a new set of English psychiatric records (Stubbs et al., 2017). Overall, the existing systems did not perform well on the new data. Here, we provide a similar comparison by cross-testing on three domains of Dutch healthcare.
This section describes the construction of our Dutch benchmark dataset called NUT (Nedap/University of Twente). The data was sampled from 9 healthcare institutes and annotated for PHI according to a tagging scheme derived from Stubbs and Uzuner (2015). Furthermore, following common practice in the preparation of de-identification corpora, we replaced PHI instances with realistic surrogates to comply with privacy regulations. To compare the performance of the de-identification methods across languages, we use the English i2b2/UTHealth and the nursing notes corpus (Stubbs and Uzuner, 2015; Neamatullah et al., 2008). An overview of the three datasets can be found in Table 1.
|Datset||NUT||i2b2 (Stubbs and Uzuner, 2015)||Nursing (Neamatullah et al., 2008)|
|Domain(s)||elderly, mental and disabled care||clinical||clinical|
|Institutes||9 (3 per domain)||2||1|
3.1. Data Sampling
We sample data from a snapshot of the databases of 9 healthcare institutes with a total of 83,000 patients. Three domains of healthcare are equally represented in this snapshot: elderly care, mental care and disabled care. We consider two classes of documents to sample from: surveys and progress reports. Surveys are questionnaire-like forms which are used by the medical staff to take notes during intake interviews, record the outcomes of medical tests or to formalize the treatment plan of a patient. Progress reports are short documents describing the current conditions of a patient receiving care, sometimes on a daily basis. The use of surveys and progress reports differs strongly across healthcare institute and domain. In total, this snapshot consists of 630,000 surveys and 13 million progress reports.
When sampling from the snapshot described above, we aim to maximize both the variety of document types, and the variety of PHI, two essential properties of a de-identification benchmark corpus (Deléger et al., 2012). First, to ensure a wide variety of document types, we select surveys in a stratified fashion according to their type label provided by the EHR system (e.g., intake interview, care plan, etc.). Second, to maximize the variety in PHI, we sample medical reports on a patient basis: for each patient, a random selection of 10 medical reports is combined into a patient file. We then select patient files uniformly at random to ensure that no patient appears multiple times within the sample. Furthermore, to control the annotation effort, we impose two subjective limits on the document length. A document has to contain at least 50 tokens, but no more than 1000 tokens to be included in the sample. For each of the 9 healthcare institutes, we sample 140 documents (70 surveys and 70 patient files), which yields a total sample size of 1260 documents (see Table 1).
We received approval for the collection and use of our dataset from the ethics review board of our institution. Due to privacy regulations, the dataset constructed in this paper cannot be shared.
3.2. Annotation Scheme
Since the GDPR does not provide any strict rules about which types of PHI should be removed during de-identification, we base our PHI tagging scheme on the guidelines defined by the US HIPAA regulations. In particular, we closely follow the annotation guidelines and the tagging scheme used by Stubbs and Uzuner (2015) which consists of 32 PHI tags among 8 classes: Name, Profession, Location, Age, Date, Contact Information, IDs and Other. The Other category is used for information that can be used to identify a patient, but which does not fall into any of the remaining categories. For example, the sentence “the patient was a guest speaker on the subject of diabetes in the Channel 2 talkshow.” would be tagged as Other. It is worth mentioning that this tagging scheme does not only capture direct identifiers relating to a patient (e.g., name and date of birth), but also indirect identifiers that could be used in combination with other information to reveal the identity of a patient. Indirect identifiers include, for example, the doctor’s name, information about the hospital and a patient’s profession.
|Category||i2b2 (Stubbs and Uzuner, 2015)||NUT|
|Name||Patient, Doctor, Username||Name|
|Location||Room, Department||Internal Location|
|Hospital, Organization||Hospital, Organization|
|Street, City, State, ZIP, Country||Address|
|Age||Over 90, Under 90||Age|
|Contact||Phone, FAX, Email||Phone/FAX, Email|
|IDs||SSN, 8 fine-grained ID tags||SSN, ID|
We made two adjustments to the tagging scheme by Stubbs and Uzuner (2015). First, to reduce the annotation effort, we merged some of the 32 fine-grained PHI tags to a more generic set of 16 tags (see Table 2). For example, the fine-grained location tags Street, City, State, ZIP, and Country were merged into a generic Address tag. While this simplifies the annotation process, it complicates the generation of realistic surrogates. Given an address string, one has to infer its format to replace the individual parts with surrogates of the same semantic type. We address this issue in Section 3.4. Second, due to the high frequency of care institutes in our dataset, we decided to introduce a separate Care Institute tag that complements the Organization tag. This allows for a straightforward surrogate generation where names of care institute are replaced with another care institute rather than with more generic company names (e.g., Google).
3.3. Annotation Process
Following previous work on the construction of de-identification benchmark corpora
(Stubbs and Uzuner, 2015; Deléger et al., 2012), we employ a double-annotation strategy: two annotators
read and tag the same documents.
In total, 12 non-domain experts annotated the sample of 1260 medical records independently
and in parallel.
The documents were randomly split into 6 sets and we randomly assigned a pair of annotators to each set.
To ensure that the annotators had a common understanding of the annotation instructions, an
evaluation session was held after each pair of annotators completed the first 20 documents.111
We include the annotation instructions that were provided to the annotators in the online repository
of this paper.
The instructions are in large parts based on the annotation guidelines in Stubbs and Uzuner (2015).
In total, it took 77 hours to double-annotate the entire dataset of 1260 documents, or
approximately 3.7 minutes per document.
We measured the inter-annotator agreement (IAA) using entity-level F1 scores.222 It has been shown that the F-score is more suitable to quantify IAA in sequence-tagging scenarios
compared to other measures such as the Kappa score
It has been shown that the F-score is more suitable to quantify IAA in sequence-tagging scenarios compared to other measures such as the Kappa score(Deléger et al., 2012). Table 3 shows the IAA per PHI category. Overall, the agreement level is fairly high (0.84). However, we find that location names (i.e., care institutes, hospitals, organizations and internal locations) are often highly ambiguous which is reflected by the low agreement scores of these categories (between 0.29 and 0.52).
|PHI Tag||Count||Frac. (%)||IAA|
To improve annotation efficiency, we integrated the rule-based de-identification tool DEDUCE (Menger et al., 2018) with our annotation software to pre-annotate each document. This functionality could be activated on a document basis by each annotator. If an annotator used this functionality, they had to review the pre-annotations, correct potential errors and check for missed PHI instances. During the evaluation sessions, annotators mentioned that the existing tool proved helpful when annotating repetitive names, dates and email addresses. Note that this pre-annotation strategy might give DEDUCE a slight advantage. However, the low performance of DEDUCE in the formal benchmark in Section 5 does not reflect this.
After annotation, the main author of this paper reviewed 19,165 annotations and resolved any disagreements between the two annotators to form the gold-standard of 17,464 PHI annotations. Table 3 shows the distribution of PHI tags after adjudication. Overall the adjudication has been done risk-averse: if only one annotator identified a piece of text as PHI, we assume that the other annotator missed this potential PHI instance. In addition to the manual adjudication, we performed two automatic checks: (1) we ensured that PHI instances occurring in multiple files received the same PHI tag, and (2) any instances that were tagged in one part of the corpora but not in the other were manually reviewed and added to the gold-standard. We used the BRAT annotation tool for both annotation and adjudication (Stenetorp et al., 2012).
3.4. Surrogate Generation
As the annotated dataset consists of personally identifying information which is protected by the GDPR, we generate artificial replacements for each of the PHI instances before using the data for the development of de-identification methods. This process is known as surrogate generation, a common practice in the preparation of de-identification corpora (Stubbs et al., 2015b). As surrogate generation will inevitably alter the semantics of the corpus to an extent where it affects the de-identification performance, it is important that this step is done as thoroughly as possible (Yeniterzi et al., 2010). Here, we follow the semi-automatic surrogate generation procedure that has been used to prepare the i2b2/UTHealth shared task corpora. Below, we summarize this procedure and mention the language specific resources we used. We refer the reader to Stubbs et al. (2015b) for a thorough discussion of the method. After running the automatic replacement scripts, we reviewed each of the surrogates to ensure that continuity within a document is preserved and no PHI is leaked into the new dataset.
We adapt the surrogate generation method of Stubbs et al. (2015b) to the Dutch language as follows. A list of 10,000 most common family names and given names is used to generate random surrogates for name PHI instances.333See www.naamkunde.net, accessed 2019-12-09 We replace dates by first parsing the format (e.g., “12 nov. 2018” “%d %b. %Y”),444Rule-based date parser: github.com/nedap/dateinfer, accessed 2019-12-09 and then randomly shifting all dates within a document by the same amount of years and days into the future. For addresses, we match names of cities, streets, and countries with a dictionary of Dutch locations,555See openov.nl, accessed 2019-12-09 and then pick random replacements from that dictionary. As Dutch ZIP codes follow a standard format (“1234AB”), their replacement is straightforward. Names of hospitals, care institutes, organizations and internal locations are randomly shuffled within the dataset. PHI instances of type Age are capped at 89 years. Finally, alphanumeric strings such as Phone/FAX, Email, URL/IP, SSN and IDs are replaced by substituting each alphanumeric character with another character of the same class. We manually rewrite Profession and Other tags, as an automatic replacement is not applicable.
This section presents the three de-identification methods and the evaluation procedure.
4.1. Rule-based Method: DEDUCE
DEDUCE is an unsupervised de-identification method specifically developed for Dutch medical records (Menger et al., 2018). It is based on lookup tables, decision rules and fuzzy string matching and has been validated on a corpus of 400 psychiatric nursing notes and treatment plans of a single hospital. Following the authors’ recommendations, we customize the method to include a list of 1200 institutions that are common in our domain. Also, we resolve two incompatibilities between the PHI coding schemes of our dataset and the DEDUCE output. First, as DEDUCE does not distinguish between hospitals, care institutes, organizations and internal locations, we group these four PHI tags under a single Named Location tag. Second, our Name annotations do not include titles (e.g., “Dr.” or “Ms.”). Therefore, titles are stripped from the DEDUCE output.
4.2. Feature-based Method: Conditional Random Field
CRFs and hybrid rule-based systems provide state-of-the-art performance in recent shared tasks (Stubbs et al., 2015a, 2017). Therefore, we implement a CRF approach to contrast with the unsupervised rule-based system. In particular, we re-implement the token-based CRF method by Liu et al. (2015) and re-use a subset666We disregard word-representation features as Liu et al. (2015) found that they had a negative performance impact. of their features (see Table 4). The linear-chain CRF is trained using LBFGS and elastic net regularization (Zou and Hastie, 2005). Using a validation set, we optimize the two regularization coefficients of the and norms with a random search in the space of with 250 trials. We use the CRFSuite implementation by Okazaki (2007).
|Bag-of-words (BOW)||Token unigrams, bigrams and trigrams within a window of of the current token.|
Same as above but with POS n-grams.
|BOW + POS||Combinations of the previous, current and next token and their POS tags.|
|Sentence||Length in tokens, presence of end-mark such as ’.’, ’?’, ’!’ and whether sentence contains unmatched brackets.|
|Affixes||Prefix and suffix of length 1 to 5.|
|Orthographic||Binary indicators about word shape: is all caps, is capitalized, capital letters inside, contains digit, contains punctuation, consists of only ASCII characters.|
|Word Shapes||The abstract shape of a token. For example, “7534-Df” becomes “####-Aa”.|
|Named-entity recognition (NER)||NER tag assigned by the spaCy tagger.|
4.3. Neural Method: BiLSTM-CRF
To reduce the need for hand-crafted features in traditional CRF-based de-identification, recent work applies neural methods (Liu et al., 2017; Dernoncourt et al., 2017; Khin et al., 2018). Here, we re-implement a BiLSTM-CRF architecture with contextual string embeddings, which has recently shown to provide state-of-the-art results for sequence labeling tasks (Akbik et al., 2018)
. Hyperparameters are set to the best performing configuration inAkbik et al. (2018)
: we use stochastic gradient descent with no momentum and an initial learning rate of 0.1. If the training loss does not decrease for 3 consecutive epochs, the learning rate is halved. Training is stopped if the learning rate falls belowor 150 epochs are reached. Furthermore, the number of hidden layers in the LSTM is set to 1 with 256 recurrent units. We employ locked dropout with a value of 0.5 and use a mini-batch size of 32. With respect to the embedding layer, we use the pre-trained GloVe (English) and fasttext (Dutch) embedding on a word-level, and concatenate them with the pre-trained contextualized string embeddings included in Flair777github.com/zalandoresearch/flair, accessed 2019-12-09 (Pennington et al., 2014; Grave et al., 2018; Akbik et al., 2019).
4.4. Preprocessing and Sequence Tagging
We use a common preprocessing routine for all three datasets. For tokenization and sentence segmentation, the spaCy tokenizer is used.888spacy.io, accessed 2019-12-09 The POS/NER features of the CRF method are generated by the built-in spaCy models. After sentence segmentation, we tag each token according to the Beginning, Inside, Outside (BIO) scheme. In rare occasions, sequence labeling methods may produce invalid transitions (e.g., O- I-). In a post-processing step, we replace invalid I- tags with B- tags (Reimers and Gurevych, 2017).
The de-identification methods are assessed according to precision, recall and F1 computed on an entity-level, the standard evaluation approach for NER systems (Tjong Kim Sang and De Meulder, 2003). In an entity-level evaluation, predicted PHI offsets and types have to match exactly. Following the evaluation of de-identification shared tasks, we use the micro-averaged entity-level F1 score as primary metric (Stubbs et al., 2015a).999De-identification systems are often also evaluated on a less strict token-level. As a system that scores high on an entity-level will also score high on a token-level, we only measure according to the stricter level of evaluation.
We randomly split our dataset and the nursing notes corpus into training, validation and testing sets with a 60/20/20 ratio. As the i2b2 corpus has a pre-defined test set of 40%, a random set of 20% of the training documents serves as validation data. Finally, we test for statistical significance using two-sided approximate randomization with (Yeh, 2000).
|NUT (Dutch)||i2b2 (English)||Nursing Notes (English)|
In this section, we first discuss the de-identification results obtained on our Dutch dataset (Section 5.1). Afterwards, we present an error analysis of the best performing method (Section 5.2). This section is concluded with the benchmark for the English datasets (Section 5.3) and the cross-domain de-identification (Section 5.4).
Entity-level precision and recall per PHI category on the NUT dataset. Scores are compared between the rule-based tagger DEDUCE(Menger et al., 2018) and the BiLSTM-CRF model. The Named Loc. tag is the union of the 4 specific location tags which are not supported by DEDUCE. Tags are ordered by frequency with location tags fixated at the bottom.
5.1. De-identification of Dutch Dataset
Both machine learning methods outperform the rule-based system DEDUCE by a large margin (see Table 5). Furthermore, the BiLSTM-CRF provides a substantial improvement of 10% points in recall over the traditional CRF method, while maintaining precision. Overall, the neural method has an entity-level recall of 87.1% while achieving a recall of 95.6% for names, showing that the neural method is operational for many de-identification scenarios. In addition, we make the following observations.
Neural method performs at least as good as rule-based method. By inspecting the model performance on a PHI-tag level, we observe that the neural method outperforms DEDUCE for all classes of PHI (see Table 6). Only for the Phone and Email category, the rule-based method has a slightly higher precision. Similarly, we studied the impact of the training data set size on the de-identification performance. Both machine learning methods outperform DEDUCE even with as little training data as 10% of the total sentences (see Figure 2). This suggests that in most environments where training data are available (or can be obtained), the machine learning methods are to be preferred.
Rule-based method can provide a “safety net.” It can be observed that DEDUCE performs reasonably well for names, phone numbers, email addresses and URLs (see Table 6). As these PHI instances are likely to directly reveal the identity of an individual, their removal is essential. However, DEDUCE does not generalize beyond the PHI types mentioned above. Especially named locations are non-trivial to capture with a rule-based system as their identification strongly relies on the availability of exhaustive lookup lists. In contrast, the neural method provides a significant improvement for named locations (5.8% vs. 65.9% recall). We assume that word-level and character-level embeddings provide an effective tool to capture these entities.
Initials, IDs and professions are hard to detect. During annotation, we observed a low F1 annotator agreement of 0.46, 0.43, and 0.31 for initials, IDs and professions, respectively. This shows that these PHI types are among the hardest to identify, even for humans (see Table 3). One possible cause for this is that IDs and initials are often hard to discriminate from abbreviations and medical measurements. We observe that the BiLSTM-CRF detects those PHI classes with high precision but low recall. With respect to professions, we find that phrases are often wrongly tagged. For example, colloquial job descriptions (e.g., “works behind the cash desk”) as opposed to the job title (e.g., “cashier”) make it infeasible to tackle this problem with lookup lists, while a machine learner likely requires more training data to capture this PHI.
5.2. Error Analysis on Dutch Dataset
To gain a better understanding of the best performing model and an intuition for its limitations, we conduct a manual error analysis of the false positives (FPs) and false negatives (FNs) produced by the BiLSTM-CRF on the test set. We discuss the error categorization scheme in Section 5.2.1 and present the results in Section 5.2.2.
5.2.1. Error Categorization
We distinguish between two error groups: (1) modeling errors, and (2) annotation/preprocessing errors. We define modeling errors to be problems that can be addressed with different de-identification techniques and additional training data. In contrast, annotation and preprocessing errors are not directly caused by the sequence labeling model, but are issues in the training data or the preprocessing pipeline which need to be addressed manually. Inspired by the classification scheme of Dernoncourt et al. (2017), we consider the following sources of modeling errors:
Abbreviation. PHI instances which are abbreviations or acronyms for names, care institutes and companies. These are hard to detect and can be ambiguous as they are easily confused with medical terms and measurements.
Ambiguity. A human reader may be unable to decide whether a given text fragment is PHI.
Debatable. It can be argued that the token should not have been annotated as PHI.
Prefix. Names of internal locations, organizations and companies are often prefixed with articles (i.e., “de” and “het”). Sometimes, it is unclear whether the prefix is part of the official name or part of the sentence construction. This ambiguity is reflected in the training data which causes the model to inconsistently include or exclude those prefixes.
Common Language. PHI instances consisting of common language are hard to discriminate from the surrounding text.
Other. Remaining modeling errors that do not fall into the categories mentioned above. In those cases, it is not immediately apparent why the misclassification occurs.
Preprocessing errors are categorized as follows:
Missing Annotation. The text fragment is PHI, but was missed during the annotation phase.
Annotation Error. The annotator assigned an invalid entity boundary.
Tokenization Error. The annotated text span could not be split into a compatible token span. Those tokens were marked as “Outside (O)” during BIO tagging.
We consider all error categories to be mutually exclusive.
|FNs ()||FPs ()|
5.2.2. Results of Error Analysis
Table 7 summarizes the error analysis results and shows the absolute
and relative frequency of each error category. Overall, we find that the majority of modeling
errors cannot be easily explained through human inspection
(“Other reason” in Table 7). The remaining errors are mainly caused
by ambiguous PHI instances and preprocessing errors.
In more detail, we make the following observations:
Abbreviations are the second most common cause for modeling errors (13.9% of FNs, 9.7% of FPs). We hypothesize that more training data will likely not in itself help to correctly identify this type of PHI. It is conceivable to design custom features (e.g., based on shape, positioning in a sentence, presence/absence in a medical dictionary) to increase precision. However, it is an open question how recall can be improved.
PHI instances consisting of common language are likely to be wrongly tagged (7.5% FNs, 3.1% FPs). This is caused by the fact that there are insufficient training examples where common language is used to refer to PHI. For example, the organization name in the sentence “Vandaag heb ik Beter Horen gebeld” (Eng: “I called Beter Horen today”) was incorrectly classified as non-PHI. Each individual word, and also the combination of the two words, can be used in different contexts without referring to PHI. However, in this specific context, it is apparent that “Beter Horen” must refer to an organization.
A substantial amount of errors is due to annotation and preprocessing issues. Annotation errors (4.5% FNs, 6.3% FPs) can be resolved by correcting the respective PHI offsets in the gold standard. Tokenization errors (8.7% FNs, 6.9% FPs) need to be fixed through a different preprocessing routine. For example, the annotation <DATE 2016>/<DATE 2017> should have been split into [2016, /, 2017] with BIO tagging [B, O, B]. However, the spaCy tokenizer segmented this text into a single token [2016/2017]. In this case, entity boundaries do no longer align with token boundaries which results in an invalid BIO tagging of [O] for the entire span.
Several false positives are in fact PHI and should be annotated. The model identifies several PHI instances which were missed during the annotation phase (11.5% of the FPs). Once more, this demonstrates that proper de-identification is an error-prone task for human annotators.
5.3. De-identification of English Datasets
When training and testing both machine learning methods on the English i2b2 and the nursing notes datasets, we can observe that the BiLSTM-CRF significantly outperforms the CRF in both cases (see Table 5
). Similar to our Dutch dataset, the neural method provides an increase of up to 11.2% points in recall (nursing notes) while the precision remains relatively stable. This shows that the neural method has the best generalization capabilities even across languages. More importantly, it does not require the development of domain-specific lookup lists or sophisticated pattern matching rules. To put the results into perspective: the second-highest ranked team in the i2b2 2014 challenge used a sophisticated ensemble combining a CRF with domain-specific rules(Stubbs et al., 2015a). Their system obtained an entity-level F1 score of 0.9124 which is on-par with the performance of our neural method that requires no configuration. We can expect that the performance of the neural method further improves after hyperparameter optimization. Finally, note that both machine learning methods can be easily applied to a new PHI tagging scheme, whereas rule-based methods are limited to the PHI definition they were developed for.
5.4. Cross-domain De-identification
In many de-identification scenarios, heterogeneous training data from multiple medical institutes and domains are rarely available. This raises the question, how well a model that has been trained on a homogeneous set of medical records generalizes to records of other medical domains. We trained the three de-identification methods on one domain of Dutch healthcare (e.g., elderly care) and tested each model on the records of the remaining two domains (e.g., disabled care and mental care). We followed the same training and evaluation procedures described in Section 4.5. Table 8 summarizes the performance of each method on the different tasks.
Again, the neural method consistently outperforms the rule-based and feature-based methods in all three domains which suggests that it is a fair default choice for de-identification. This is underlined by the fact that the amount of training data is severely limited in this experiment: each domain only has 420 documents of which 20% of the records are reserved for testing. Interestingly, DEDUCE performs rather stable and even outperforms the CRF within the domain of elderly care.
Summary of the transfer learning experiment on our Dutch dataset. Each method is trained on data of one care domain and tested on the other two domains. All scores are micro-averaged entity-level F1.
Given an ideal de-identification method, one would expect that performance on unseen data of a different domain is similar to the test score obtained on the available (homogeneous) data. Table 9 shows a performance breakdown for each of the three testing domains for the neural method. It can be seen that in 4 out of 6 cases, the test score in a new domain is lower than the test score obtained on the in-domain data. The largest delta of the observed in-domain test score (disabled care, 0.919 F1) and the performance in the transfer domain (elderly care, 0.698 F1) is 0.221 in F1. This raises an important point when performing de-identification in practice: while the neural method shows the best generalization capabilities compared to the other de-identification methods, the performance can still be significantly lower when applying a pre-trained model in new domains.
While the contextual string embeddings used in this paper have shown to provide state-of-the-art results for NER (Akbik et al., 2018), transformer-based architectures for contextual embeddings have also gained significant attention (e.g., BERT (Devlin et al., 2018)). It would make an interesting experiment to benchmark different types of pre-trained embeddings for the task of de-identification. Furthermore, we observe that the neural method provides strong performance even with limited training data (see Figure 2). It is unclear what contribution large pre-trained embeddings have in those scenarios which warrants an ablation study testing different model configurations. We leave the exploration of those ideas to future research.
This paper presents the construction of a novel Dutch dataset and a comparison of state-of-the-art de-identification methods across Dutch and English medical records. Our experiments show the following. (1) An existing rule-based method for the Dutch language does not generalize well to new domains. (2) If one is looking for an out-of-the-box de-identification method, neural approaches show the best generalization performance across languages and domains. (3) When testing across different domains, a substantial decrease of performance has to be expected, an important consideration when applying de-identification in practice.
There are several directions for future work. Motivated by the limited generalizability of pre-trained models across different domains, transfer learning techniques can provide a way forward. A preliminary study by Lee et al. (2018) shows that they can be beneficial for de-identification. Finally, our experiments show that phrases such as professions are among the most difficult information to de-identify. It is an open challenge how to design methods that can capture this type of information.
- The MITRE identification scrubber toolkit: design, training, and assessment. I. J. Medical Informatics 79 (12), pp. 849–859. Cited by: §2.
- Pooled contextualized embeddings for named entity recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 724–728. External Links: Cited by: §4.3.
- Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649. External Links: Cited by: item 3, §2, §4.3, §5.5.
- Building gold standard corpora for medical natural language processing tasks. In AMIA 2012, American Medical Informatics Association Annual Symposium, Chicago, Illinois, USA, November 3-7, 2012, Cited by: §2, §3.1, §3.3, footnote 2.
De-identification of patient notes with recurrent neural networks. JAMIA 24 (3), pp. 596–606. Cited by: §2, §4.3, §5.2.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. Computing Research Repository arXiv:1810.04805. External Links: Cited by: §5.5.
- Computer-assisted de-identification of free text in the MIMIC II database. In Computers in Cardiology, 2004, pp. 341–344. Cited by: §1.
- Overview of the health search and data mining (hsdm 2020) workshop. In Proceedings of the Thirteenth ACM International Conference on Web Search and Data Mining, WSDM ’20, New York, NY, USA. External Links: Cited by: §1.
- Regulation on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (data protection directive). Official Journal of the European Union L119, pp. 1–88. Cited by: §1.
- Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), External Links: Cited by: §4.3.
- Evaluation of a Deidentification (De-Id) Software Engine to Share Pathology Reports and Clinical Documents for Research. American Journal of Clinical Pathology 121 (2), pp. 176–186. Cited by: §2.
- CRFs based de-identification of medical records. Journal of Biomedical Informatics 58 (S), pp. S39–S46. Cited by: §2.
- Health insurance portability and accountability act. Public Law 104-191. Cited by: §1.
- External Links: Cited by: §1.
A deep learning architecture for de-identification of patient notes: implementation and evaluation. Computing Research Repository arXiv:1810.01570. External Links: Cited by: §2, §4.3.
- Transfer learning for named-entity recognition with neural networks. In Proceedings of the 11th Language Resources and Evaluation Conference, Miyazaki, Japan, pp. 4470–4473. External Links: Cited by: §6.
- Automatic de-identification of electronic medical records using token-level and character-level conditional random fields. Journal of Biomedical Informatics 58, pp. S47–S52. Cited by: item 2, §2, §4.2, Table 4, footnote 6.
- De-identification of clinical notes via recurrent neural network and conditional random field. Journal of Biomedical Informatics 75 (S), pp. S34–S42. External Links: Cited by: §2, §4.3.
- DEDUCE: a pattern matching method for automatic de-identification of dutch medical text. Telematics and Informatics 35 (4), pp. 727–736. External Links: Cited by: item 1, §2, §2, §3.3, §4.1, Table 6.
- De-identification of unstructured clinical data for patient privacy protection. In Medical Data Privacy Handbook, A. Gkoulalas-Divanis and G. Loukides (Eds.), pp. 697–716. Cited by: §1, §1, §2.
- Automated de-identification of free-text medical records. BMC Med. Inf. & Decision Making 8, pp. 32. Cited by: §1, §2, §2, Table 1, §3.
- Clinical natural language processing in languages other than english: opportunities and challenges. Journal of Biomedical Semantics 9 (1), pp. 12:1–12:13. Cited by: §2.
- External Links: Cited by: §4.2.
- GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543. External Links: Cited by: §4.3.
- Optimal hyperparameters for deep LSTM-networks for sequence labeling tasks. Computing Research Repository arXiv:1707.06799. External Links: Cited by: §4.4.
- De-identification of german medical admission notes. Studies in health technology and informatics 253, pp. 165–169. Cited by: §2.
- De-identification of clinical free text in dutch with limited training data: a case study. In Proceedings of the Workshop on NLP for Medicine and Biology associated with RANLP 2013, pp. 18–23. External Links: Cited by: §2.
- BRAT: a web-based tool for NLP-assisted text annotation. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, Stroudsburg, PA, USA, pp. 102–107. External Links: Cited by: §3.3.
- De-identification of psychiatric intake records: overview of 2016 CEGS N-GRID shared tasks track 1. Journal of Biomedical Informatics 75, pp. S4–S18. Cited by: §2, §2, §4.2.
- Automated systems for the de-identification of longitudinal clinical narratives: overview of 2014 i2b2/UTHealth shared task track 1. Journal of Biomedical Informatics 58, pp. S11–S19. Cited by: §2, §4.2, §4.5, §5.3.
- Challenges in synthesizing surrogate PHI in narrative EMRs. In Medical Data Privacy Handbook, A. Gkoulalas-Divanis and G. Loukides (Eds.), pp. 717–735. Cited by: §3.4, §3.4.
- Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2/UTHealth corpus. Journal of Biomedical Informatics 58, pp. S20–S29. Cited by: §2, §3.2, §3.2, §3.3, Table 1, Table 2, §3, footnote 1.
- De-identification of Dutch Medical Text. In 2nd Healthcare Text Analytics Conference (HealTAC2019), Cited by: §2.
- Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, Stroudsburg, PA, USA, pp. 142–147. External Links: Cited by: §4.5.
- More accurate tests for the statistical significance of result differences. In Proceedings of the 18th Conference on Computational Linguistics - Volume 2, COLING ’00, Stroudsburg, PA, USA, pp. 947–953. External Links: Cited by: §4.5.
- Effects of personal identifier resynthesis on clinical text de-identification. JAMIA 17 (2), pp. 159–168. Cited by: §3.4.
- Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society, Series B 67, pp. 301–320. Cited by: §4.2.