Fairness auditing frameworks are necessary for operationalizing machine learning algorithms in healthcare (ML4H). In particular, they must identify and characterize biases (Chen et al., 2021). Ongoing directives to promote health equity must also translate to these spaces, with care placed on those historically vulnerable to the most harm, such as communities with chronic illnesses and racial and ethnic minorities (Oala et al., 2020; Joszt, 2022). To do this, they must be prioritized when evaluating for fairness in ML4H (Rajkomar et al., 2018; Chen et al., 2021; Röösli et al., 2022).
Commercialized auditing tools are being increasingly leveraged for bias assessment in ML4H algorithms (Oala et al., 2020; Kumar et al., 2020). However, we argue that applying out-of-the-box auditing tools without a clear patient-centric design is not enough. Existing auditing tools must align with health ethics principles that guide a framework’s operationalization. In guiding ML4H auditing literature, this means the tool must be able to detect locally biased patient subgroups when monitoring the fairness of ML4H throughout its lifecycle (de Hond et al., 2022). To monitor disparities with health equity in mind, researchers must also engage critically with the broader sociotechnical context surrounding the use of ML auditing tools in healthcare (Pfohl et al., 2021).
This work addresses the gap by devising a patient-centric ML auditing tool called SLOGAN. SLOGAN adapts LOGAN (Zhao and Chang, 2020), an unsupervised algorithm that uses contextual word embeddings (Devlin et al., 2018) to cluster local groups of bias indicated by model performance differences. To better align auditing with measures of effective care planning and therapeutic intervention (Katz et al., 2016), SLOGAN identifies local group biases in clinical prediction tasks by leveraging patient risk stratification. Previous medical history is also commonly used for understanding health inequities through social, cultural, and structural barriers the patient experiences (Brennan Ramirez et al., 2008). Therefore, SLOGAN characterizes these local biases using patients’ electronic healthcare records (EHR) histories.
Experiments on in-hospital mortality prediction demonstrate how SLOGAN effectively identifies local group biases. We audit the model across 12 MIMIC-III patient subgroups. We then provide a case study to further examine fairness differences in patients with chronic illnesses such as Diabetes Mellitus. Results indicate that (1) SLOGAN, on average, captures more considerable biases than LOGAN, and (2) such identified biases align with existing health disparity literature.
Background and Related Work
Algorithmic Auditing in ML for Healthcare
Obermeyer et al. (2019) audit a commercialized ML4H algorithm by dissecting observed disparities between patient risk and overall health cost. The authors call for the continued probing of health inequity in these clinical systems. Likewise, Wiegand et al. (2019); Pfohl et al. (2021); Siala and Wang (2022); de Hond et al. (2022) create guidelines for operationalizing transparent assessments of ML4H models. Auditing frameworks such as Aequitas 111http://aequitas.dssg.io/ and AIFairness360 222https://aif360.mybluemix.net/ are operationalized for this purpose (Oala et al., 2020). The tools provide reports relevant to protected groups and fairness metrics, indicating unfairness through preset disparity ranges.
Measuring Health Equity Barriers
Intersectional social identities are related to a patient’s health outcomes (McGinnis et al., 2002; Katz et al., 2018). Therefore, measuring health equity in ML requires understanding a patient beyond their illness. In practice, this can include focusing on populations with histories of a significant illness burden or examining bias from the lens of social determinants of health (SDOH). Fairness literature has also dictated a need to measure biases from multidimensional perspectives (Hanna et al., 2020). Capturing social context beyond protected attributes is helpful for this cause. SDOH, such as unequal access to healthcare, language, stigma, racism, and social community, are underlying contributing factors to health inequities (Aday, 1994; Peek et al., 2007; Brennan Ramirez et al., 2008).
Fairness and Local Bias Detection
LOGAN (Zhao and Chang, 2020)
, a method to detect local bias, adapts K-Means to cluster BERT embeddings while maximizing a bias metric within each cluster. LOGAN consists of a 2-part objective: a K-Means clustering objective () and an objective to maximize a bias metric (, e.g. the performance gap between 2 groups) within each respective cluster.
is a tunable hyperparameter to control the tradeoff between the two objectives and indicates how strongly to cluster with respect to group performance differences. We define our bias metric as the model performance disparity between 2 groups, measured by accuracy. However, detecting biases by identifying similar contextual representations is not enough. The task must be adapted to the clinical domain to audit with health equity in mind. One way to do this is by incorporating domain-specific information. For example, severity scores stratify patients based on their immediate needs and help clinicians decide how to allocate resources effectively. Therefore, we build off of LOGAN and create a tool that translates to the medical setting by mindfully using this informationFerreira et al. (2001).
Clinical NLP Pretrained Embeddings
Several BERT models are publicly available for use in the clinical setting. These include various implementations of ClinicalBERT (Alsentzer et al., 2019; Huang et al., 2019). We proceed with leveraging a variant of ClinicalBERT from Zhang et al. (2020) as this is an extension of ClinicalBERT with improvements such as whole-word masking.
|K-Means||LOGAN||SLOGAN||# of MIMIC-III Attributes|
|Inertia ()||1.0||0.991||0.981||7/12 (58%)|
|SCR ()||15.3||22.9||30.1||12/12 (100%)|
|SIR ()||15.3||18.4||23.4||7/12 (58%)|
||Bias| ()||12.5||21.5||34.2||9/12 (75%)|
Average values for 12 MIMIC-III attributes across models and evaluation metrics. SCR, SIR, and |Bias| in %. |Bias| is the average absolute model performance difference in biased clusters. Bold is the best performance per row. Right-most column is number of MIMIC-III attributes where SLOGAN performs best. Arrows indicate desired direction of a number.
Automatic Bias Detection
To create a patient-centric bias detection tool, we encourage SLOGAN to identify large bias gaps while accounting for similarity in patient severity. SLOGAN measures local biases in a model using patient-specific features and contextual embeddings of patient history for in-hospital mortality prediction. We do this via a patient similarity constraint. A variety of patient severity scores such as OASIS, SAPS II, and SOFA are available for use (Le Gall et al., 1993; Jones et al., 2009; Johnson et al., 2013). Following health literature and clinician advice, we select the SOFA acuity score. However, depending on clinician needs, a different constraint may be used (e.g., ICD-9 codes). Extending Eq. (1), this results in the following optimization problem:
where is added to encourage the model to group patients with similar acute severity. and are hyperparameters that control the tradeoff between the objectives of grouping patient similarity and clustering by local bias.
and are tuned via a grid search and we choose the combination that identifies the largest local group biases (Appendix Table 4).
We define the bias score as having at least a 10% difference in accuracy and at most a SOFA score difference of 0.8. 333 We compare SLOGAN to LOGAN and K-Means across three metrics. To measure the utility of the clusters found, we examine the ratio of biased clusters found (SCR) and the number of instances in those clusters (SIR). We use inertia to measure clustering quality, as it reflects how well the data clustered across respective centroids. Finally, we compare each algorithm’s inertia to a baseline K-Means model normalized to 1.0.
Data and Setup
In order to maximize reproducibility, we perform experiments with the same patient cohorts defined in the benchmark dataset from the MIMIC-III clinical database (Johnson et al., 2016; Harutyunyan et al., 2019). Following Sun et al. (2022), to understand how BERT represents social determinants of health and captures possible stigmatizing language in the data, we extracted the history of present illness, past medical history, social history, and family history across physicians, nursing, and discharge summaries (Marmot, 2005). We employed MedSpacy (Eyre et al., (in press, n.d.)) to extract any information related to a patient’s social determinants of health. After preprocessing, this translated into a 70% train, 15% validation, and 15% test split of 1581, 393, and 309 patients, respectively. No patient appeared across the splits. Analyses were conducted across self-identified ethnicity, sex, insurance type, English speaking, presence of chronic illness, presence of diabetes (type I and II), social determinants of health, and negative patient descriptors to measure stigma. We also explored creating cross-sectional groups (Appendix Table 1).
We used SLOGAN to audit a fully connected neural network fromZhang et al. (2020) used to predict in-hospital mortality, a common MIMIC-3 benchmarking task Harutyunyan et al. (2019). 444A patient that has passed within 48 hours of their ICU stay is assigned the label of 1, otherwise patients are assigned the label 0. Each patient note in the test set was encoded and concatenated with gender, OASIS, SAPS II, SOFA scores, and age. To provide a rich contextual representation of patient notes to SLOGAN, encodings consisted of the concatenated last four layers of ClinicalBERT (Devlin et al., 2018). The embeddings encoded 512 tokens, the maximum number of tokens for BERT. We followed the best hyperparameters of the model and chose the threshold that provides at least 80% accuracy on the validation set.
We assessed SLOGAN’s local bias clustering abilities and quality across 12 attributes in MIMIC-III, including demographic variables such as ethnicity and gender. The model was compared to K-Means and LOGAN using the SCR, SIR, Bias, and Inertia measurements introduced in the previous sections. We report these results in Table 1. In most attributes, SLOGAN was the best at identifying groups with fairness gaps. Identified groups contained more instances and larger biases, while maintaining clustering quality. In particular, SLOGAN identified the most and largest local group biases in at least 9/12 (75%) attributes, measured by SCR and Bias, respectively. When comparing LOGAN and K-Means, SLOGAN found the highest ratio of biased instances within biased clusters (SIR) in 7/12 (58%) MIMIC-3 attributes. We report audits across all attributes in Appendix Table 2.
Case Study: Diabetes Mellitus
Diabetes is one of the most common and costly chronic conditions worldwide, accompanied by serious comorbidities(Ceriello et al., 2012). To further study this, we used SLOGAN to assess the local group biases on the HAS DIABETES attribute and identified fairness gaps in agreement with health literature.
We report the accuracy and maximum absolute performance differences across identified biased clusters by K-Means, LOGAN, and SLOGAN in Table 2. The performance difference overall between patients that do and do not have diabetes was 9.1%. K-Means and LOGAN identified local groups with larger performance discrepancies (20% and 28.1%, respectively). Notably, SLOGAN performed the best at identifying a local region with the largest performance gap (37.1%). We also report the SCR, SIR, Bias, and Inertia in Table 3. Results indicate that SLOGAN found groups with a larger average bias magnitude than K-Means and LOGAN. While LOGAN and SLOGAN identified the same ratio of biased clusters (25.0%), SLOGAN identified the largest local bias region (28.6%) with a small tradeoff in inertia (Appendix Figure 1).
To more carefully examine clusters formed by SLOGAN, we show respective performance deviations in Figure 1. We found that SLOGAN identified fairness gaps documented in health literature. Two clusters exhibited a large local bias towards patients without diabetes, clusters 1 and 4. We analyzed differences in cluster characteristics between the most and least biased cluster. The most biased cluster, cluster 4, contained 38% more patients with chronic illnesses besides diabetes, with 33.3% suffering from chronic illnesses besides diabetes or hypertension. We then compared cluster 4 to all other clusters. Again, we found that it contained the largest percentage of (1) patients (62.5%) with chronic illnesses besides diabetes and (2) patients with chronic illnesses besides diabetes and hypertension (25%). Cluster 4 also had fewer patients with private insurance than the least biased cluster and the lowest percentage of English-speaking patients (4.6%) in the entire dataset (Appendix Table 3). Notably, these differences in disease burden, insurance, and language align with existing research indicating how populations with the largest health disparities often suffer from a larger burden of disease and may experience significant structural language barriers (Flores, 2005; Peek et al., 2007).
Bias Interpretation with Topic Modeling
Severe diabetes complications may result in various forms of deadly infections and respiratory issues (Joshi et al., 1999; Muller et al., 2005; De Santi et al., 2017). Provided the in-mortality task, we asked if indications of severe diabetes complications were present when using SLOGAN. To do this, we ran Latent Dirichlet Allocation topic modeling (Blei et al., 2003) within identified SLOGAN clusters. We detail the preprocessing steps in the appendix. Table 4 lists the top 20 topic words for the most and least biased clusters. SLOGAN grouped patients with histories indicating deadly infections and respiratory issues in the most biased cluster. Terms included “sputum” (thick respiratory secretion), “Acinobacter” (bacteria that can live in respiratory secretions), and “Vanco” (used to treat infections).
Social determinants of health also correlate to effective self-management of diabetes (Clark and Utz, 2014; Adu et al., 2019). Therefore we also examined differences in social determinants of health between the least and most biased clusters. While LDA cannot determine the directionality of SDOH impact, the top 20 terms are among the most important when forming the cluster’s topic distribution. In the least biased cluster, top words included terms around the community such as ‘home’, ‘offspring’, ‘children’, and ‘sibling’. However, in the most biased cluster, just 1 of the 20 terms, ‘parent’, reflected possible existing social support.
We developed SLOGAN as a framework to audit an ML4H task by identifying areas of patient severity-aware local biases. Results indicated that SLOGAN captures more and higher quality clusters across several subgroups than the baseline models, K-Means and LOGAN. To illustrate how to use SLOGAN in a clinical context, we conducted a case study that used SLOGAN to identify clusters of local bias in diabetic patients. We found that the biases observed aligned with existing health literature. In particular, the cluster with the largest local bias was also the cluster with the largest disease burden. This result demonstrates a need to further examine and repeat these experiments across patient cohorts and performance metrics. Interesting future works may include asking how models encode vulnerable communities in their representations and if health disparities consistently propagate into model biases.
In practice, SLOGAN can be used to determine biased clusters for review before model deployment in a healthcare setting. The tool may also track how biases shift due to changes in the data or across operationalization in different hospital networks. Furthermore, patient-centric local bias detection can supplement ML4H model auditing. With this information, ML researchers and clinicians can use auditing report cards to decide on the next steps for inclusive model development.
|Most biased (40.0)||parent, given, recent, vanco, treat, fever, acinetobacter, ecg, negative, intubated, disorder, bottles, clozaril, complete, sputum, past, started, ed, found, admitted|
|Least biased (0.2)||noted, past, recent, home, given, due, pain, two, offspring, mild, chest, initially, without, blood, vancomycin, children, shortness_breath, sibling, admitted, started|
Ethical Statement & Limitations
Our analysis used MIMIC-III data, an open deidentified clinical dataset. Only credentialed researchers who fulfilled all training requirements and abided by the data use agreement accessed the data. 555https://physionet.org/content/mimiciii/1.4/#files We review the data and clinical notes a second time to confirm the removal of any patient-related information, including location, age, name, date, or hospital.
In practice, further interdisciplinary discussion on how SLOGAN can best be integrated into the ML4H auditing pipeline is welcomed. While we do not analyze the factors influencing model fairness, we encourage this future work. Furthermore, it is important to note that the absence of flagged bias clusters is not an indicator of a total absence of risk for downstream unfair outcomes.
Appendix A Appendix
LDA is run using the NLTK and gensim packages (Loper and Bird, 2002; Řehůřek et al., 2011). Unigrams and bigrams are generated using gensim.phrases with min count=3 and threshold=5. The LDA is run on gensim with random state=100, updateevery=1, chunksize=100, and passes=100. To get achieve better topic modeling, words like child, son, daughter are tokenized as ’offspring’. Words pertaining to father or mother are replaced with ’parent’. Words such as hypertension and hypertensive are replaced with ’hypert’. Similarly, words such as hypotension and hypotensive are replaced with ’hypot’.
Negative Patient Descriptors
We will publically release the code in an easily accessible repository upon review of this paper.
|Has Negative Descriptor||8.86|
|Has Chronic Illness||88.0|
|Assigned Male at Birth (AMAB)||56.29|
|Assigned Female at Birth (AFAB)||43.71|
|AFAB + Self-Identifies Black||8.86|
|Assigned Male at Birth (AMAB)||-38.3|
|Has Chronic Illness, Not Diabetes||37.8|
|Has Chronic Illness, Not Diabetes or Hypertension||33.3|
|Has Acute Illness||27.8|
|Has Negative Descriptor||-20||0|
|Has Chronic Illness||-30||50|
|Assigned Male at Birth (AMAB)||-10||60|
|Assigned Female at Birth (AFAB)||0||70|
|AFAB + Self-Identifies Black||-10||60|
|Most biased (38.7)||denies, rehab, treat, pain, well,|
|sputum, transferred, hx, valve,|
|sent, course, cxr, chest pain, one,|
|episodes, mild, cough, floor,|
|worsening, disease, tobacco|
|Least biased (0.67)||pain, given, denies, admit, home,|
|time, last, well, hip, past, started,|
|disease, found, noted, transferred,|
|liver, developed, treat,|
|symptoms, nausea, blood|
|Most biased (32.7)||disease, cardiac, lives, received,|
|given, admit, denies, parent, family,|
|cath, symptoms, cancer, positive,|
|diabetes mellitus, type, past, time,|
|alcohol, cad, recently, ct|
|Least biased (3.4)||abdominal pain, denies, pain, started,|
|chest pain, chronic, cough, disease,|
|transferred, past, hyperlipidemia, patient,|
|time, given, hypert, recent, cardiac, ros,|
|shortness breath, complaints, found|
- Health status of vulnerable populations. Annual review of public health 15 (1), pp. 487–509. Cited by: Measuring Health Equity Barriers.
- Enablers and barriers to effective diabetes self-management: a multi-national investigation. PloS one 14 (6), pp. e0217771. Cited by: Bias Interpretation with Topic Modeling.
Publicly available clinical BERT embeddings.
Proceedings of the 2nd Clinical Natural Language Processing Workshop, Minneapolis, Minnesota, USA, pp. 72–78. External Links: Cited by: Clinical NLP Pretrained Embeddings.
- Latent dirichlet allocation. Journal of machine Learning research 3 (Jan), pp. 993–1022. Cited by: Bias Interpretation with Topic Modeling.
- Promoting health equity; a resource to help communities address social determinants of health. Cited by: Introduction, Measuring Health Equity Barriers.
- Diabetes as a case study of chronic disease management with a personalized approach: the role of a structured feedback loop. Diabetes research and clinical practice 98 (1), pp. 5–10. Cited by: Cluster Analysis.
Ethical machine learning in healthcare.
Annual Review of Biomedical Data Science4, pp. 123–144. Cited by: Introduction.
- Social determinants of type 2 diabetes and health in the united states. World journal of diabetes 5 (3), pp. 296. Cited by: Bias Interpretation with Topic Modeling.
Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. npj Digital Medicine 5 (1), pp. 1–13. Cited by: Introduction, Algorithmic Auditing in ML for Healthcare.
- Type 2 diabetes is associated with an increased prevalence of respiratory symptoms as compared to the general population. BMC Pulmonary Medicine 17 (1), pp. 1–8. Cited by: Bias Interpretation with Topic Modeling.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Introduction, Data and Setup.
- Launching into clinical space with medspacy: a new clinical text processing toolkit in python. In AMIA Annual Symposium Proceedings 2021, External Links: Cited by: Data and Setup.
- Serial evaluation of the sofa score to predict outcome in critically ill patients. Jama 286 (14), pp. 1754–1758. Cited by: Fairness and Local Bias Detection.
- The impact of medical interpreter services on the quality of health care: a systematic review. Medical care research and review 62 (3), pp. 255–299. Cited by: Cluster Analysis.
- Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp. 501–512. Cited by: Measuring Health Equity Barriers.
- Multitask learning and benchmarking with clinical time series data. Scientific data 6 (1), pp. 1–18. Cited by: Data and Setup, Data and Setup.
- Clinicalbert: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. Cited by: Clinical NLP Pretrained Embeddings.
- A new severity of illness scale using a subset of acute physiology and chronic health evaluation data elements shows comparable predictive accuracy. Critical care medicine 41 (7), pp. 1711–1718. Cited by: Automatic Bias Detection.
- MIMIC-iii, a freely accessible critical care database. Scientific data 3 (1), pp. 1–9. Cited by: Data and Setup.
- The sequential organ failure assessment score for predicting outcome in patients with severe sepsis and evidence of hypoperfusion at the time of emergency department presentation. Critical care medicine 37 (5), pp. 1649. Cited by: Automatic Bias Detection.
- Infections in patients with diabetes mellitus. New England Journal of Medicine 341 (25), pp. 1906–1912. Cited by: Bias Interpretation with Topic Modeling.
- 5 vulnerable populations in healthcare. External Links: Cited by: Introduction.
- Association of the social determinants of health with quality of primary care. The Annals of Family Medicine 16 (3), pp. 217–224. Cited by: Measuring Health Equity Barriers.
- The genesis, maturation, and future of critical care cardiology. Journal of the American College of Cardiology 68 (1), pp. 67–79. Cited by: Introduction.
- A machine learning system for retaining patients in hiv care. arXiv preprint arXiv:2006.04944. Cited by: Introduction.
- A new simplified acute physiology score (saps ii) based on a european/north american multicenter study. Jama 270 (24), pp. 2957–2963. Cited by: Automatic Bias Detection.
- Nltk: the natural language toolkit. arXiv preprint cs/0205028. Cited by: Appendix A.
- Social determinants of health inequalities. The lancet 365 (9464), pp. 1099–1104. Cited by: Data and Setup.
- The case for more active policy attention to health promotion. Health affairs 21 (2), pp. 78–93. Cited by: Measuring Health Equity Barriers.
- Increased risk of common infections in patients with type 1 and type 2 diabetes mellitus. Clinical infectious diseases 41 (3), pp. 281–288. Cited by: Bias Interpretation with Topic Modeling.
- Ml4h auditing: from paper to practice. In Machine learning for health, pp. 280–317. Cited by: Introduction, Introduction, Algorithmic Auditing in ML for Healthcare.
- Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 (6464), pp. 447–453. Cited by: Algorithmic Auditing in ML for Healthcare.
- Diabetes health disparities. Medical care research and review 64 (5_suppl), pp. 101S–156S. Cited by: Measuring Health Equity Barriers, Cluster Analysis.
- An empirical characterization of fair machine learning for clinical risk prediction. Journal of biomedical informatics 113, pp. 103621. Cited by: Introduction, Algorithmic Auditing in ML for Healthcare.
- Ensuring fairness in machine learning to advance health equity. Annals of internal medicine 169 (12), pp. 866–872. Cited by: Introduction.
- Gensim—statistical semantics in python. Retrieved from genism. org. Cited by: Appendix A.
- Peeking into a black box, the fairness and generalizability of a mimic-iii benchmarking model. Scientific Data 9 (1), pp. 1–13. Cited by: Introduction.
- SHIFTing artificial intelligence to be responsible in healthcare: a systematic review. Social Science & Medicine, pp. 114782. Cited by: Algorithmic Auditing in ML for Healthcare.
- Negative patient descriptors: documenting racial bias in the electronic health record: study examines racial bias in the patient descriptors used in the electronic health record.. Health Affairs, pp. 10–1377. Cited by: Appendix A, Data and Setup.
- WHO and itu establish benchmarking process for artificial intelligence in health. The Lancet 394 (10192), pp. 9–11. Cited by: Algorithmic Auditing in ML for Healthcare.
- Hurtful words: quantifying biases in clinical contextual word embeddings. In proceedings of the ACM Conference on Health, Inference, and Learning, pp. 110–120. Cited by: Clinical NLP Pretrained Embeddings, Data and Setup.
- LOGAN: local group bias detection by clustering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1968–1977. Cited by: Introduction, Fairness and Local Bias Detection.