DeepAI
Log In Sign Up

Auditing Algorithmic Fairness in Machine Learning for Health with Severity-Based LOGAN

11/16/2022
by   Anaelia Ovalle, et al.
25

Auditing machine learning-based (ML) healthcare tools for bias is critical to preventing patient harm, especially in communities that disproportionately face health inequities. General frameworks are becoming increasingly available to measure ML fairness gaps between groups. However, ML for health (ML4H) auditing principles call for a contextual, patient-centered approach to model assessment. Therefore, ML auditing tools must be (1) better aligned with ML4H auditing principles and (2) able to illuminate and characterize communities vulnerable to the most harm. To address this gap, we propose supplementing ML4H auditing frameworks with SLOGAN (patient Severity-based LOcal Group biAs detectioN), an automatic tool for capturing local biases in a clinical prediction task. SLOGAN adapts an existing tool, LOGAN (LOcal Group biAs detectioN), by contextualizing group bias detection in patient illness severity and past medical history. We investigate and compare SLOGAN's bias detection capabilities to LOGAN and other clustering techniques across patient subgroups in the MIMIC-III dataset. On average, SLOGAN identifies larger fairness disparities in over 75 clustering quality. Furthermore, in a diabetes case study, health disparity literature corroborates the characterizations of the most biased clusters identified by SLOGAN. Our results contribute to the broader discussion of how machine learning biases may perpetuate existing healthcare disparities.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

08/01/2022

Disparate Censorship Undertesting: A Source of Label Bias in Clinical Machine Learning

As machine learning (ML) models gain traction in clinical applications, ...
11/08/2022

Algorithmic Bias in Machine Learning Based Delirium Prediction

Although prediction models for delirium, a commonly occurring condition ...
10/06/2020

LOGAN: Local Group Bias Detection by Clustering

Machine learning techniques have been widely used in natural language pr...
07/21/2022

Detecting and Preventing Shortcut Learning for Fair Medical AI using Shortcut Testing (ShorT)

Machine learning (ML) holds great promise for improving healthcare, but ...
04/21/2022

A Sandbox Tool to Bias(Stress)-Test Fairness Algorithms

Motivated by the growing importance of reducing unfairness in ML predict...
06/27/2022

Prisoners of Their Own Devices: How Models Induce Data Bias in Performative Prediction

The unparalleled ability of machine learning algorithms to learn pattern...

Introduction

Fairness auditing frameworks are necessary for operationalizing machine learning algorithms in healthcare (ML4H). In particular, they must identify and characterize biases (Chen et al., 2021). Ongoing directives to promote health equity must also translate to these spaces, with care placed on those historically vulnerable to the most harm, such as communities with chronic illnesses and racial and ethnic minorities (Oala et al., 2020; Joszt, 2022). To do this, they must be prioritized when evaluating for fairness in ML4H (Rajkomar et al., 2018; Chen et al., 2021; Röösli et al., 2022).

Commercialized auditing tools are being increasingly leveraged for bias assessment in ML4H algorithms (Oala et al., 2020; Kumar et al., 2020). However, we argue that applying out-of-the-box auditing tools without a clear patient-centric design is not enough. Existing auditing tools must align with health ethics principles that guide a framework’s operationalization. In guiding ML4H auditing literature, this means the tool must be able to detect locally biased patient subgroups when monitoring the fairness of ML4H throughout its lifecycle (de Hond et al., 2022). To monitor disparities with health equity in mind, researchers must also engage critically with the broader sociotechnical context surrounding the use of ML auditing tools in healthcare (Pfohl et al., 2021).

This work addresses the gap by devising a patient-centric ML auditing tool called SLOGAN. SLOGAN adapts LOGAN (Zhao and Chang, 2020), an unsupervised algorithm that uses contextual word embeddings (Devlin et al., 2018) to cluster local groups of bias indicated by model performance differences. To better align auditing with measures of effective care planning and therapeutic intervention (Katz et al., 2016), SLOGAN identifies local group biases in clinical prediction tasks by leveraging patient risk stratification. Previous medical history is also commonly used for understanding health inequities through social, cultural, and structural barriers the patient experiences (Brennan Ramirez et al., 2008). Therefore, SLOGAN characterizes these local biases using patients’ electronic healthcare records (EHR) histories.

Experiments on in-hospital mortality prediction demonstrate how SLOGAN effectively identifies local group biases. We audit the model across 12 MIMIC-III patient subgroups. We then provide a case study to further examine fairness differences in patients with chronic illnesses such as Diabetes Mellitus. Results indicate that (1) SLOGAN, on average, captures more considerable biases than LOGAN, and (2) such identified biases align with existing health disparity literature.

Background and Related Work

Algorithmic Auditing in ML for Healthcare

Obermeyer et al. (2019) audit a commercialized ML4H algorithm by dissecting observed disparities between patient risk and overall health cost. The authors call for the continued probing of health inequity in these clinical systems. Likewise, Wiegand et al. (2019); Pfohl et al. (2021); Siala and Wang (2022); de Hond et al. (2022) create guidelines for operationalizing transparent assessments of ML4H models. Auditing frameworks such as Aequitas 111http://aequitas.dssg.io/ and AIFairness360 222https://aif360.mybluemix.net/ are operationalized for this purpose (Oala et al., 2020). The tools provide reports relevant to protected groups and fairness metrics, indicating unfairness through preset disparity ranges.

Measuring Health Equity Barriers

Intersectional social identities are related to a patient’s health outcomes (McGinnis et al., 2002; Katz et al., 2018). Therefore, measuring health equity in ML requires understanding a patient beyond their illness. In practice, this can include focusing on populations with histories of a significant illness burden or examining bias from the lens of social determinants of health (SDOH). Fairness literature has also dictated a need to measure biases from multidimensional perspectives (Hanna et al., 2020). Capturing social context beyond protected attributes is helpful for this cause. SDOH, such as unequal access to healthcare, language, stigma, racism, and social community, are underlying contributing factors to health inequities (Aday, 1994; Peek et al., 2007; Brennan Ramirez et al., 2008).

Fairness and Local Bias Detection

LOGAN (Zhao and Chang, 2020)

, a method to detect local bias, adapts K-Means to cluster BERT embeddings while maximizing a bias metric within each cluster. LOGAN consists of a 2-part objective: a K-Means clustering objective (

) and an objective to maximize a bias metric (, e.g. the performance gap between 2 groups) within each respective cluster.

(1)

where

is a tunable hyperparameter to control the tradeoff between the two objectives and indicates how strongly to cluster with respect to group performance differences. We define our bias metric as the model performance disparity between 2 groups, measured by accuracy. However, detecting biases by identifying similar contextual representations is not enough. The task must be adapted to the clinical domain to audit with health equity in mind. One way to do this is by incorporating domain-specific information. For example, severity scores stratify patients based on their immediate needs and help clinicians decide how to allocate resources effectively. Therefore, we build off of LOGAN and create a tool that translates to the medical setting by mindfully using this information

Ferreira et al. (2001).

Methodology

Clinical NLP Pretrained Embeddings

Several BERT models are publicly available for use in the clinical setting. These include various implementations of ClinicalBERT (Alsentzer et al., 2019; Huang et al., 2019). We proceed with leveraging a variant of ClinicalBERT from Zhang et al. (2020) as this is an extension of ClinicalBERT with improvements such as whole-word masking.

K-Means LOGAN SLOGAN # of MIMIC-III Attributes
Inertia () 1.0 0.991 0.981 7/12 (58%)
SCR () 15.3 22.9 30.1 12/12 (100%)
SIR () 15.3 18.4 23.4 7/12 (58%)
|Bias| () 12.5 21.5 34.2 9/12 (75%)
Table 1:

Average values for 12 MIMIC-III attributes across models and evaluation metrics. SCR, SIR, and |Bias| in %. |Bias| is the average absolute model performance difference in biased clusters. Bold is the best performance per row. Right-most column is number of MIMIC-III attributes where SLOGAN performs best. Arrows indicate desired direction of a number.

Automatic Bias Detection

To create a patient-centric bias detection tool, we encourage SLOGAN to identify large bias gaps while accounting for similarity in patient severity. SLOGAN measures local biases in a model using patient-specific features and contextual embeddings of patient history for in-hospital mortality prediction. We do this via a patient similarity constraint. A variety of patient severity scores such as OASIS, SAPS II, and SOFA are available for use (Le Gall et al., 1993; Jones et al., 2009; Johnson et al., 2013). Following health literature and clinician advice, we select the SOFA acuity score. However, depending on clinician needs, a different constraint may be used (e.g., ICD-9 codes). Extending Eq.  (1), this results in the following optimization problem:

(2)

where is added to encourage the model to group patients with similar acute severity. and are hyperparameters that control the tradeoff between the objectives of grouping patient similarity and clustering by local bias.

(3)

and are tuned via a grid search and we choose the combination that identifies the largest local group biases (Appendix Table 4).

We define the bias score as having at least a 10% difference in accuracy and at most a SOFA score difference of 0.8.  333

We choose the thresholds by splitting the data and creating bootstrap estimates 1000 times, then add three standard deviations.

We compare SLOGAN to LOGAN and K-Means across three metrics. To measure the utility of the clusters found, we examine the ratio of biased clusters found (SCR) and the number of instances in those clusters (SIR). We use inertia to measure clustering quality, as it reflects how well the data clustered across respective centroids. Finally, we compare each algorithm’s inertia to a baseline K-Means model normalized to 1.0.

Data and Setup

In order to maximize reproducibility, we perform experiments with the same patient cohorts defined in the benchmark dataset from the MIMIC-III clinical database (Johnson et al., 2016; Harutyunyan et al., 2019). Following Sun et al. (2022), to understand how BERT represents social determinants of health and captures possible stigmatizing language in the data, we extracted the history of present illness, past medical history, social history, and family history across physicians, nursing, and discharge summaries (Marmot, 2005). We employed MedSpacy (Eyre et al., (in press, n.d.)) to extract any information related to a patient’s social determinants of health. After preprocessing, this translated into a 70% train, 15% validation, and 15% test split of 1581, 393, and 309 patients, respectively. No patient appeared across the splits. Analyses were conducted across self-identified ethnicity, sex, insurance type, English speaking, presence of chronic illness, presence of diabetes (type I and II), social determinants of health, and negative patient descriptors to measure stigma. We also explored creating cross-sectional groups (Appendix Table 1).

We used SLOGAN to audit a fully connected neural network from

Zhang et al. (2020) used to predict in-hospital mortality, a common MIMIC-3 benchmarking task Harutyunyan et al. (2019). 444A patient that has passed within 48 hours of their ICU stay is assigned the label of 1, otherwise patients are assigned the label 0. Each patient note in the test set was encoded and concatenated with gender, OASIS, SAPS II, SOFA scores, and age. To provide a rich contextual representation of patient notes to SLOGAN, encodings consisted of the concatenated last four layers of ClinicalBERT (Devlin et al., 2018). The embeddings encoded 512 tokens, the maximum number of tokens for BERT. We followed the best hyperparameters of the model and chose the threshold that provides at least 80% accuracy on the validation set.

Results

Aggregate Analysis

We assessed SLOGAN’s local bias clustering abilities and quality across 12 attributes in MIMIC-III, including demographic variables such as ethnicity and gender. The model was compared to K-Means and LOGAN using the SCR, SIR, Bias, and Inertia measurements introduced in the previous sections. We report these results in Table 1. In most attributes, SLOGAN was the best at identifying groups with fairness gaps. Identified groups contained more instances and larger biases, while maintaining clustering quality. In particular, SLOGAN identified the most and largest local group biases in at least 9/12 (75%) attributes, measured by SCR and Bias, respectively. When comparing LOGAN and K-Means, SLOGAN found the highest ratio of biased instances within biased clusters (SIR) in 7/12 (58%) MIMIC-3 attributes. We report audits across all attributes in Appendix Table 2.

Case Study: Diabetes Mellitus

Has Diabetes Method Acc-Yes Acc-No |Bias|
Global 75.0 84.1 9.1
K-Means 55.0 75.0 20.0
LOGAN 60.0 88.0 28.0
SLOGAN 54.5 91.7 37.1
Table 2: Bias detection (%) for in-hospital mortality task. Global indicates global bias. “Yes” indicates patient with diabetes. |Bias| is the max absolute model performance difference in biased clusters. SLOGAN identifies local biases greater than global bias observed in the data (bold).
Has Diabetes Method Inertia SCR SIR |Bias|
K-Means 1.00 33.3 27.1 14.2
LOGAN 1.003 25.0 16.9 25.0
SLOGAN 1.12 25.0 15.4 28.6
Table 3: Comparison under diabetes attribute. SCR and SIR are respectively the % of biased clusters and % of biased instances. |Bias|(%) is the average absolute bias score for the biased clusters. SLOGAN finds the largest bias (bold).

Cluster Analysis

Diabetes is one of the most common and costly chronic conditions worldwide, accompanied by serious comorbidities(Ceriello et al., 2012). To further study this, we used SLOGAN to assess the local group biases on the HAS DIABETES attribute and identified fairness gaps in agreement with health literature.

We report the accuracy and maximum absolute performance differences across identified biased clusters by K-Means, LOGAN, and SLOGAN in Table 2. The performance difference overall between patients that do and do not have diabetes was 9.1%. K-Means and LOGAN identified local groups with larger performance discrepancies (20% and 28.1%, respectively). Notably, SLOGAN performed the best at identifying a local region with the largest performance gap (37.1%). We also report the SCR, SIR, Bias, and Inertia in Table 3. Results indicate that SLOGAN found groups with a larger average bias magnitude than K-Means and LOGAN. While LOGAN and SLOGAN identified the same ratio of biased clusters (25.0%), SLOGAN identified the largest local bias region (28.6%) with a small tradeoff in inertia (Appendix Figure 1).

To more carefully examine clusters formed by SLOGAN, we show respective performance deviations in Figure 1. We found that SLOGAN identified fairness gaps documented in health literature. Two clusters exhibited a large local bias towards patients without diabetes, clusters 1 and 4. We analyzed differences in cluster characteristics between the most and least biased cluster. The most biased cluster, cluster 4, contained 38% more patients with chronic illnesses besides diabetes, with 33.3% suffering from chronic illnesses besides diabetes or hypertension. We then compared cluster 4 to all other clusters. Again, we found that it contained the largest percentage of (1) patients (62.5%) with chronic illnesses besides diabetes and (2) patients with chronic illnesses besides diabetes and hypertension (25%). Cluster 4 also had fewer patients with private insurance than the least biased cluster and the lowest percentage of English-speaking patients (4.6%) in the entire dataset (Appendix Table 3). Notably, these differences in disease burden, insurance, and language align with existing research indicating how populations with the largest health disparities often suffer from a larger burden of disease and may experience significant structural language barriers (Flores, 2005; Peek et al., 2007).

Bias Interpretation with Topic Modeling

Severe diabetes complications may result in various forms of deadly infections and respiratory issues (Joshi et al., 1999; Muller et al., 2005; De Santi et al., 2017). Provided the in-mortality task, we asked if indications of severe diabetes complications were present when using SLOGAN. To do this, we ran Latent Dirichlet Allocation topic modeling (Blei et al., 2003) within identified SLOGAN clusters. We detail the preprocessing steps in the appendix. Table 4 lists the top 20 topic words for the most and least biased clusters. SLOGAN grouped patients with histories indicating deadly infections and respiratory issues in the most biased cluster. Terms included “sputum” (thick respiratory secretion), “Acinobacter” (bacteria that can live in respiratory secretions), and “Vanco” (used to treat infections).

Social determinants of health also correlate to effective self-management of diabetes (Clark and Utz, 2014; Adu et al., 2019). Therefore we also examined differences in social determinants of health between the least and most biased clusters. While LDA cannot determine the directionality of SDOH impact, the top 20 terms are among the most important when forming the cluster’s topic distribution. In the least biased cluster, top words included terms around the community such as ‘home’, ‘offspring’, ‘children’, and ‘sibling’. However, in the most biased cluster, just 1 of the 20 terms, ‘parent’, reflected possible existing social support.

Figure 1: Performance differences for HAS DIABETES attribute. Furthest right red box shows global bias, while SLOGAN finds a local area of much higher bias at cluster 4.

Discussion

We developed SLOGAN as a framework to audit an ML4H task by identifying areas of patient severity-aware local biases. Results indicated that SLOGAN captures more and higher quality clusters across several subgroups than the baseline models, K-Means and LOGAN. To illustrate how to use SLOGAN in a clinical context, we conducted a case study that used SLOGAN to identify clusters of local bias in diabetic patients. We found that the biases observed aligned with existing health literature. In particular, the cluster with the largest local bias was also the cluster with the largest disease burden. This result demonstrates a need to further examine and repeat these experiments across patient cohorts and performance metrics. Interesting future works may include asking how models encode vulnerable communities in their representations and if health disparities consistently propagate into model biases.

In practice, SLOGAN can be used to determine biased clusters for review before model deployment in a healthcare setting. The tool may also track how biases shift due to changes in the data or across operationalization in different hospital networks. Furthermore, patient-centric local bias detection can supplement ML4H model auditing. With this information, ML researchers and clinicians can use auditing report cards to decide on the next steps for inclusive model development.

Most biased (40.0) parent, given, recent, vanco, treat, fever, acinetobacter, ecg, negative, intubated, disorder, bottles, clozaril, complete, sputum, past, started, ed, found, admitted
Least biased (0.2) noted, past, recent, home, given, due, pain, two, offspring, mild, chest, initially, without, blood, vancomycin, children, shortness_breath, sibling, admitted, started
Table 4: Top 20 topic words in the most and least biased clusters using SLOGAN for HAS DIABETES attribute. Number is the bias score (%) of that cluster.

Ethical Statement & Limitations

Our analysis used MIMIC-III data, an open deidentified clinical dataset. Only credentialed researchers who fulfilled all training requirements and abided by the data use agreement accessed the data. 555https://physionet.org/content/mimiciii/1.4/#files We review the data and clinical notes a second time to confirm the removal of any patient-related information, including location, age, name, date, or hospital.

In practice, further interdisciplinary discussion on how SLOGAN can best be integrated into the ML4H auditing pipeline is welcomed. While we do not analyze the factors influencing model fairness, we encourage this future work. Furthermore, it is important to note that the absence of flagged bias clusters is not an indicator of a total absence of risk for downstream unfair outcomes.

Appendix A Appendix

Lda

LDA is run using the NLTK and gensim packages (Loper and Bird, 2002; Řehůřek et al., 2011). Unigrams and bigrams are generated using gensim.phrases with min count=3 and threshold=5. The LDA is run on gensim with random state=100, updateevery=1, chunksize=100, and passes=100. To get achieve better topic modeling, words like child, son, daughter are tokenized as ’offspring’. Words pertaining to father or mother are replaced with ’parent’. Words such as hypertension and hypertensive are replaced with ’hypert’. Similarly, words such as hypotension and hypotensive are replaced with ’hypot’.

Negative Patient Descriptors

We explored the SDOH dimension of stigma in clinical notes through the extraction of negative patient descriptors found in (Sun et al., 2022) and outline the results in the Appendix Table 5. However, further preprocessing beyond the usage of regexes is needed to reduce false positive rates.

Code

We will publically release the code in an easily accessible repository upon review of this paper.

Figure 1: t-SNE results with circled most biased cluster for HAS DIABETES attribute
Group Percent (%)
Has Negative Descriptor 8.86
Has Diabetes 35.43
Has Chronic Illness 88.0
Medicaid Insurance 7.71
Medicare Insurance 60.86
Private Insurance 28.0
Speaks English 86.57
Assigned Male at Birth (AMAB) 56.29
Assigned Female at Birth (AFAB) 43.71
Self-identifies White 75.14
Self-Identifies Black 13.43
AFAB + Self-Identifies Black 8.86
Table 1: Percent of attribute in the MIMIC-3 data

width= Has Diabetes Method Inertia SCR SIR |Bias| K-Means 1.00 0.33 0.27 0.14 LOGAN 1.00 0.25 0.17 0.25 SLOGAN 1.12 0.25 0.15 0.29 Has Negative Method Inertia SCR SIR |Bias| K-Means 1.00 0.00 0.00 0.00 LOGAN 0.88 0.20 0.19 0.20 SLOGAN 0.85 0.20 0.19 0.37 Has Chronic Illness Method Inertia SCR SIR |Bias| K-Means 1.00 0.17 0.25 0.17 LOGAN 1.15 0.40 0.32 0.40 SLOGAN 0.89 0.50 0.47 0.23 Is Medicaid Insurance Method Inertia SCR SIR |Bias| K-Means 1.00 0.40 0.46 0.23 LOGAN 0.99 0.20 0.25 0.20 SLOGAN 0.94 0.20 0.11 0.76 Is Medicare Insurance Method Inertia SCR SIR |Bias| K-Means 1.00 0.13 0.13 0.21 LOGAN 0.91 0.22 0.22 0.22 SLOGAN 0.87 0.22 0.16 0.21 Is Private Insurance Method Inertia SCR SIR |Bias| K-Means 1.00 0.22 0.20 0.12 LOGAN 1.18 0.14 0.12 0.14 SLOGAN 1.12 0.14 0.10 0.26 Is English Speaker Method Inertia SCR SIR |Bias| K-Means 1.00 0.00 0.00 0.00 LOGAN 1.02 0.17 0.17 0.17 SLOGAN 0.91 0.43 0.44 0.31 Assigned Male at Birth (AMAB) Method Inertia SCR SIR |Bias| K-Means 1.00 0.00 0.00 0.00 LOGAN 1.00 0.11 0.09 0.11 SLOGAN 1.03 0.25 0.13 0.41 Assigned Female at Birth (AFAB) Method Inertia SCR SIR |Bias| K-Means 1.00 0.00 0.00 0.00 LOGAN 1.00 0.11 0.09 0.11 SLOGAN 1.05 0.13 0.04 0.39 Self-identifies White Method Inertia SCR SIR |Bias| K-Means 1.00 0.14 0.13 0.14 LOGAN 0.86 0.38 0.37 0.38 SLOGAN 0.98 0.40 0.41 0.26 Self-identifies Black Method Inertia SCR SIR |Bias| K-Means 1.00 0.40 0.28 0.20 LOGAN 0.91 0.20 0.10 0.20 SLOGAN 1.02 0.20 0.10 0.27 Self-identifies Black + AFAB Method Inertia SCR SIR |Bias| K-Means 1.00 0.20 0.13 0.28 LOGAN 1.00 0.20 0.13 0.20 SLOGAN 0.99 0.60 0.49 0.35

Table 2: Comparison between K-Means, LOGAN, and SLOGAN under each attribute type. SCR and SIR are respectively the ratio of biased clusters and ratio of biased instances. |Bias| is the averaged absolute bias score for these biased clusters. Results not shown in %.
Group (%)
Private Insurance -100.0
Medicaid Insurance 11.1
Medicaid Insurance 51.5
Self-Identifies White 36.5
Self-Identifies Black N/A
Self-Identifies Hispanic N/A
Self-Identifies Asian N/A
Self-Identifies Other 11.1
English Speaker -1.6
Assigned Male at Birth (AMAB) -38.3
Has Chronic Illness, Not Diabetes 37.8
Has Chronic Illness, Not Diabetes or Hypertension 33.3
Hypertensive 11.1
Has Acute Illness 27.8
Table 3: Percentage differences (, in %) in characteristics between most and least biased cluster for HAS DIABETES attribute. A positive number means the most biased cluster has more instances of this attribute versus the least biased cluster. N/A indicates division by zero.
Group
Has Diabetes -30 50
Has Negative Descriptor -20 0
Has Chronic Illness -30 50
Medicaid Insurance -70 30
Medicare Insurance -50 0
Private Insurance -70 40
Speaks English -30 0
Assigned Male at Birth (AMAB) -10 60
Assigned Female at Birth (AFAB) 0 70
Self-Identifies White -30 20
Self-Identifies Black -20 60
AFAB + Self-Identifies Black -10 60
Table 4: Hyper parameter search for and after searching between combinations between -100-0 and 0-100, respectively.
Most biased (38.7) denies, rehab, treat, pain, well,
sputum, transferred, hx, valve,
sent, course, cxr, chest pain, one,
episodes, mild, cough, floor,
worsening, disease, tobacco
Least biased (0.67) pain, given, denies, admit, home,
time, last, well, hip, past, started,
disease, found, noted, transferred,
liver, developed, treat,
symptoms, nausea, blood
Table 5: Most and Least Biased LDA top 20 words for HAS NEGATIVE DESCRIPTOR patient descriptor. Number in parentheses is the bias score (%) of that cluster.
Most biased (32.7) disease, cardiac, lives, received,
given, admit, denies, parent, family,
cath, symptoms, cancer, positive,
diabetes mellitus, type, past, time,
alcohol, cad, recently, ct
Least biased (3.4) abdominal pain, denies, pain, started,
chest pain, chronic, cough, disease,
transferred, past, hyperlipidemia, patient,
time, given, hypert, recent, cardiac, ros,
shortness breath, complaints, found
Table 6: Top 20 topic words in the most and least biased cluster using SLOGAN under IS ENGLISH SPEAKER. Number in parentheses is the bias score (%) of that cluster.

References

  • L. A. Aday (1994) Health status of vulnerable populations. Annual review of public health 15 (1), pp. 487–509. Cited by: Measuring Health Equity Barriers.
  • M. D. Adu, U. H. Malabu, A. E. Malau-Aduli, and B. S. Malau-Aduli (2019) Enablers and barriers to effective diabetes self-management: a multi-national investigation. PloS one 14 (6), pp. e0217771. Cited by: Bias Interpretation with Topic Modeling.
  • E. Alsentzer, J. Murphy, W. Boag, W. Weng, D. Jindi, T. Naumann, and M. McDermott (2019) Publicly available clinical BERT embeddings. In

    Proceedings of the 2nd Clinical Natural Language Processing Workshop

    ,
    Minneapolis, Minnesota, USA, pp. 72–78. External Links: Link, Document Cited by: Clinical NLP Pretrained Embeddings.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: Bias Interpretation with Topic Modeling.
  • L. K. Brennan Ramirez, E. A. Baker, and M. Metzler (2008) Promoting health equity; a resource to help communities address social determinants of health. Cited by: Introduction, Measuring Health Equity Barriers.
  • A. Ceriello, L. Barkai, J. S. Christiansen, L. Czupryniak, R. Gomis, K. Harno, B. Kulzer, J. Ludvigsson, Z. Némethyová, D. Owens, et al. (2012) Diabetes as a case study of chronic disease management with a personalized approach: the role of a structured feedback loop. Diabetes research and clinical practice 98 (1), pp. 5–10. Cited by: Cluster Analysis.
  • I. Y. Chen, E. Pierson, S. Rose, S. Joshi, K. Ferryman, and M. Ghassemi (2021) Ethical machine learning in healthcare.

    Annual Review of Biomedical Data Science

    4, pp. 123–144.
    Cited by: Introduction.
  • M. L. Clark and S. W. Utz (2014) Social determinants of type 2 diabetes and health in the united states. World journal of diabetes 5 (3), pp. 296. Cited by: Bias Interpretation with Topic Modeling.
  • A. A. de Hond, A. M. Leeuwenberg, L. Hooft, I. M. Kant, S. W. Nijman, H. J. van Os, J. J. Aardoom, T. Debray, E. Schuit, M. van Smeden, et al. (2022)

    Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review

    .
    npj Digital Medicine 5 (1), pp. 1–13. Cited by: Introduction, Algorithmic Auditing in ML for Healthcare.
  • F. De Santi, G. Zoppini, F. Locatelli, E. Finocchio, V. Cappa, M. Dauriz, and G. Verlato (2017) Type 2 diabetes is associated with an increased prevalence of respiratory symptoms as compared to the general population. BMC Pulmonary Medicine 17 (1), pp. 1–8. Cited by: Bias Interpretation with Topic Modeling.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, Data and Setup.
  • H. Eyre, A. B. Chapman, K. S. Peterson, J. Shi, P. R. Alba, M. M. Jones, T. L. Box, S. L. DuVall, and O. V. Patterson ((in press, n.d.)) Launching into clinical space with medspacy: a new clinical text processing toolkit in python. In AMIA Annual Symposium Proceedings 2021, External Links: Link Cited by: Data and Setup.
  • F. L. Ferreira, D. P. Bota, A. Bross, C. Mélot, and J. Vincent (2001) Serial evaluation of the sofa score to predict outcome in critically ill patients. Jama 286 (14), pp. 1754–1758. Cited by: Fairness and Local Bias Detection.
  • G. Flores (2005) The impact of medical interpreter services on the quality of health care: a systematic review. Medical care research and review 62 (3), pp. 255–299. Cited by: Cluster Analysis.
  • A. Hanna, E. Denton, A. Smart, and J. Smith-Loud (2020) Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 501–512. Cited by: Measuring Health Equity Barriers.
  • H. Harutyunyan, H. Khachatrian, D. C. Kale, G. Ver Steeg, and A. Galstyan (2019) Multitask learning and benchmarking with clinical time series data. Scientific data 6 (1), pp. 1–18. Cited by: Data and Setup, Data and Setup.
  • K. Huang, J. Altosaar, and R. Ranganath (2019) Clinicalbert: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. Cited by: Clinical NLP Pretrained Embeddings.
  • A. E. Johnson, A. A. Kramer, and G. D. Clifford (2013) A new severity of illness scale using a subset of acute physiology and chronic health evaluation data elements shows comparable predictive accuracy. Critical care medicine 41 (7), pp. 1711–1718. Cited by: Automatic Bias Detection.
  • A. E. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi, and R. G. Mark (2016) MIMIC-iii, a freely accessible critical care database. Scientific data 3 (1), pp. 1–9. Cited by: Data and Setup.
  • A. E. Jones, S. Trzeciak, and J. A. Kline (2009) The sequential organ failure assessment score for predicting outcome in patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation. Critical care medicine 37 (5), pp. 1649. Cited by: Automatic Bias Detection.
  • N. Joshi, G. M. Caputo, M. R. Weitekamp, and A. Karchmer (1999) Infections in patients with diabetes mellitus. New England Journal of Medicine 341 (25), pp. 1906–1912. Cited by: Bias Interpretation with Topic Modeling.
  • L. Joszt (2022) 5 vulnerable populations in healthcare. External Links: Link Cited by: Introduction.
  • A. Katz, D. Chateau, J. E. Enns, J. Valdivia, C. Taylor, R. Walld, and S. McCulloch (2018) Association of the social determinants of health with quality of primary care. The Annals of Family Medicine 16 (3), pp. 217–224. Cited by: Measuring Health Equity Barriers.
  • J. N. Katz, M. Minder, B. Olenchock, S. Price, M. Goldfarb, J. B. Washam, C. F. Barnett, L. K. Newby, and S. van Diepen (2016) The genesis, maturation, and future of critical care cardiology. Journal of the American College of Cardiology 68 (1), pp. 67–79. Cited by: Introduction.
  • A. Kumar, A. Ramachandran, A. De Unanue, C. Sung, J. Walsh, J. Schneider, J. Ridgway, S. M. Schuette, J. Lauritsen, and R. Ghani (2020) A machine learning system for retaining patients in hiv care. arXiv preprint arXiv:2006.04944. Cited by: Introduction.
  • J. Le Gall, S. Lemeshow, and F. Saulnier (1993) A new simplified acute physiology score (saps ii) based on a european/north american multicenter study. Jama 270 (24), pp. 2957–2963. Cited by: Automatic Bias Detection.
  • E. Loper and S. Bird (2002) Nltk: the natural language toolkit. arXiv preprint cs/0205028. Cited by: Appendix A.
  • M. Marmot (2005) Social determinants of health inequalities. The lancet 365 (9464), pp. 1099–1104. Cited by: Data and Setup.
  • J. M. McGinnis, P. Williams-Russo, and J. R. Knickman (2002) The case for more active policy attention to health promotion. Health affairs 21 (2), pp. 78–93. Cited by: Measuring Health Equity Barriers.
  • L. Muller, K. Gorter, E. Hak, W. Goudzwaard, F. Schellevis, A. Hoepelman, and G. Rutten (2005) Increased risk of common infections in patients with type 1 and type 2 diabetes mellitus. Clinical infectious diseases 41 (3), pp. 281–288. Cited by: Bias Interpretation with Topic Modeling.
  • L. Oala, J. Fehr, L. Gilli, P. Balachandran, A. W. Leite, S. Calderon-Ramirez, D. X. Li, G. Nobis, E. A. M. Alvarado, G. Jaramillo-Gutierrez, et al. (2020) Ml4h auditing: from paper to practice. In Machine learning for health, pp. 280–317. Cited by: Introduction, Introduction, Algorithmic Auditing in ML for Healthcare.
  • Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan (2019) Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 (6464), pp. 447–453. Cited by: Algorithmic Auditing in ML for Healthcare.
  • M. E. Peek, A. Cargill, and E. S. Huang (2007) Diabetes health disparities. Medical care research and review 64 (5_suppl), pp. 101S–156S. Cited by: Measuring Health Equity Barriers, Cluster Analysis.
  • S. R. Pfohl, A. Foryciarz, and N. H. Shah (2021) An empirical characterization of fair machine learning for clinical risk prediction. Journal of biomedical informatics 113, pp. 103621. Cited by: Introduction, Algorithmic Auditing in ML for Healthcare.
  • A. Rajkomar, M. Hardt, M. D. Howell, G. Corrado, and M. H. Chin (2018) Ensuring fairness in machine learning to advance health equity. Annals of internal medicine 169 (12), pp. 866–872. Cited by: Introduction.
  • R. Řehůřek, P. Sojka, et al. (2011) Gensim—statistical semantics in python. Retrieved from genism. org. Cited by: Appendix A.
  • E. Röösli, S. Bozkurt, and T. Hernandez-Boussard (2022) Peeking into a black box, the fairness and generalizability of a mimic-iii benchmarking model. Scientific Data 9 (1), pp. 1–13. Cited by: Introduction.
  • H. Siala and Y. Wang (2022) SHIFTing artificial intelligence to be responsible in healthcare: a systematic review. Social Science & Medicine, pp. 114782. Cited by: Algorithmic Auditing in ML for Healthcare.
  • M. Sun, T. Oliwa, M. E. Peek, and E. L. Tung (2022) Negative patient descriptors: documenting racial bias in the electronic health record: study examines racial bias in the patient descriptors used in the electronic health record.. Health Affairs, pp. 10–1377. Cited by: Appendix A, Data and Setup.
  • T. Wiegand, R. Krishnamurthy, M. Kuglitsch, N. Lee, S. Pujari, M. Salathé, M. Wenzel, and S. Xu (2019) WHO and itu establish benchmarking process for artificial intelligence in health. The Lancet 394 (10192), pp. 9–11. Cited by: Algorithmic Auditing in ML for Healthcare.
  • H. Zhang, A. X. Lu, M. Abdalla, M. McDermott, and M. Ghassemi (2020) Hurtful words: quantifying biases in clinical contextual word embeddings. In proceedings of the ACM Conference on Health, Inference, and Learning, pp. 110–120. Cited by: Clinical NLP Pretrained Embeddings, Data and Setup.
  • J. Zhao and K. Chang (2020) LOGAN: local group bias detection by clustering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1968–1977. Cited by: Introduction, Fairness and Local Bias Detection.