Development and validation of a natural language processing algorithm to pseudonymize documents in the context of a clinical data warehouse

03/23/2023
by   Xavier Tannier, et al.
0

The objective of this study is to address the critical issue of de-identification of clinical reports in order to allow access to data for research purposes, while ensuring patient privacy. The study highlights the difficulties faced in sharing tools and resources in this domain and presents the experience of the Greater Paris University Hospitals (AP-HP) in implementing a systematic pseudonymization of text documents from its Clinical Data Warehouse. We annotated a corpus of clinical documents according to 12 types of identifying entities, and built a hybrid system, merging the results of a deep learning model as well as manual rules. Our results show an overall performance of 0.99 of F1-score. We discuss implementation choices and present experiments to better understand the effort involved in such a task, including dataset size, document types, language models, or rule addition. We share guidelines and code under a 3-Clause BSD license.

READ FULL TEXT

page 1

page 11

page 20

page 21

page 31

research
06/12/2023

EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing

The utilization of clinical reports for various secondary purposes, incl...
research
09/21/2023

Improving VTE Identification through Adaptive NLP Model Selection and Clinical Expert Rule-based Classifier from Radiology Reports

Rapid and accurate identification of Venous thromboembolism (VTE), a sev...
research
04/30/2019

FastContext: an efficient and scalable implementation of the ConText algorithm

Objective: To develop and evaluate FastContext, an efficient, scalable i...
research
09/16/2022

De-Identification of French Unstructured Clinical Notes for Machine Learning Tasks

Unstructured textual data are at the heart of health systems: liaison le...
research
07/02/2020

NLNDE: The Neither-Language-Nor-Domain-Experts' Way of Spanish Medical Document De-Identification

Natural language processing has huge potential in the medical domain whi...
research
06/27/2020

Normalizador Neural de Datas e Endereços

Documents of any kind present a wide variety of date and address formats...

Please sign up or login with your details

Forgot password? Click here to reset