In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

05/18/2023
by   Yuxin Xiao, et al.
0

Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.

READ FULL TEXT

page 6

page 7

page 14

research
05/09/2022

Behind the Mask: Demographic bias in name detection for PII masking

Many datasets contain personally identifiable information, or PII, which...
research
04/13/2021

On the Impact of Random Seeds on the Fairness of Clinical Classifiers

Recent work has shown that fine-tuning large networks is surprisingly se...
research
09/07/2022

Decoding Demographic un-fairness from Indian Names

Demographic classification is essential in fairness assessment in recomm...
research
01/23/2019

Attenuating Bias in Word Vectors

Word vector representations are well developed tools for various NLP and...
research
07/16/2022

Data Representativeness in Accessibility Datasets: A Meta-Analysis

As data-driven systems are increasingly deployed at scale, ethical conce...
research
12/15/2017

Health Data in an Open World

With the aim of informing sound policy about data sharing and privacy, w...
research
10/26/2022

A Robust Bias Mitigation Procedure Based on the Stereotype Content Model

The Stereotype Content model (SCM) states that we tend to perceive minor...

Please sign up or login with your details

Forgot password? Click here to reset