Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?

04/15/2021
by   Eric Lehman, et al.
0

Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR. However, more sophisticated "attacks" may succeed in doing so: To facilitate such research, we make our experimental setup and baseline probing models available at https://github.com/elehman16/exposing_patient_data_release

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2018

Towards the Creation of a Large Corpus of Synthetically-Identified Clinical Notes

Clinical notes often describe the most important aspects of a patient's ...
research
06/13/2019

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Inspired by the success of the General Language Understanding Evaluation...
research
09/06/2019

Improved Patient Classification with Language Model Pretraining Over Clinical Notes

Clinical notes in electronic health records contain highly heterogeneous...
research
10/30/2019

Phenotyping of Clinical Notes with Improved Document Classification Models Using Contextualized Neural Language Models

Clinical notes contain an extensive record of a patient's health status,...
research
07/05/2023

EHRSHOT: An EHR Benchmark for Few-Shot Evaluation of Foundation Models

While the general machine learning (ML) community has benefited from pub...
research
06/10/2023

Medical Data Augmentation via ChatGPT: A Case Study on Medication Identification and Medication Event Classification

The identification of key factors such as medications, diseases, and rel...
research
07/15/2020

Predicting Clinical Diagnosis from Patients Electronic Health Records Using BERT-based Neural Networks

In this paper we study the problem of predicting clinical diagnoses from...

Please sign up or login with your details

Forgot password? Click here to reset