What Do You See in this Patient? Behavioral Testing of Clinical NLP Models

by   Betty van Aken, et al.

Decision support systems based on clinical notes have the potential to improve patient care by pointing doctors towards overseen risks. Predicting a patient's outcome is an essential part of such systems, for which the use of deep neural networks has shown promising results. However, the patterns learned by these networks are mostly opaque and previous work revealed flaws regarding the reproduction of unintended biases. We thus introduce an extendable testing framework that evaluates the behavior of clinical outcome models regarding changes of the input. The framework helps to understand learned patterns and their influence on model decisions. In this work, we apply it to analyse the change in behavior with regard to the patient characteristics gender, age and ethnicity. Our evaluation of three current clinical NLP models demonstrates the concrete effects of these characteristics on the models' decisions. They show that model behavior varies drastically even when fine-tuned on the same data and that allegedly best-performing models have not always learned the most medically plausible patterns.



There are no comments yet.


page 6

page 9


Literature-Augmented Clinical Outcome Prediction

Predictive models for medical outcomes hold great promise for enhancing ...

Write It Like You See It: Detectable Differences in Clinical Notes By Race Lead To Differential Model Recommendations

Clinical notes are becoming an increasingly important data source for ma...

Unsupervised patient representations from clinical notes with interpretable classification decisions

We have two main contributions in this work: 1. We explore the usage of ...

A Logic-Based Learning Approach to Explore Diabetes Patient Behaviors

Type I Diabetes (T1D) is a chronic disease in which the body's ability t...

Decision Support in the Context of a Complex Decision Situation

The aim of a clinical decision support tool is to reduce the complexity ...

Algorithmic encoding of protected characteristics and its implications on disparities across subgroups

It has been rightfully emphasized that the use of AI for clinical decisi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Outcome prediction from clinical notes.

The use of automatic systems in the medical domain is promising due to their potential exposure to large amounts of data from earlier patients. This data can include information that helps doctors make better decisions regarding diagnoses and treatments of a patient at hand. Outcome prediction models take patient information as input and then output probabilities for all considered outcomes

(Choi et al., 2018, Khadanga et al., 2019). We focus this work on outcome models using natural language in the form of clinical notes as an input, since they are a common source of patient information and contain a multitude of possible variables.

The problem of black box models and biases.

Recent models show promising results on tasks such as mortality prediction (Si and Roberts, 2019) and diagnosis prediction (Liu et al., 2018, Choi et al., 2018). However, since most of the proposed models work as black boxes we do not know which features they consider important for their decisions and how they interpret certain patient characteristics. From earlier work we also know that highly parameterized models are prone to emphasize biases in the data (Sun et al., 2019a). Such biases are known to be especially dangerous in the clinical domain (Straw, 2020). We further argue that they have high potential to disadvantage minority groups as their behavior towards out-of-distribution samples is often unpredictable. Thus, understanding models and their shortcomings is an essential prerequisite for their application in the clinical domain. We argue that more in-depth evaluations are needed to know whether such models have learned medically meaningful patterns or not.

Figure 1: Minimal alterations to the patient description can have a large impact on outcome predictions of clinical NLP models. We introduce behavioral testing for the clinical domain to analyse whether a model has learned useful or harmful patterns.

Behavioral testing for the clinical domain.

As a step towards this goal, we introduce a novel testing framework specifically for the clinical domain that enables us to examine the influence of certain patient characteristics on the model predictions. Our work is motivated by behavioral testing frameworks for general Natural Language Processing (NLP) tasks

(Ribeiro et al., 2020) in which model behavior is observed under changing input data. Our framework incorporates a number of test cases and is further extendable to the needs of individual data sets and clinical tasks.

Influence of patient characteristics.

As an initial case study we apply the framework to analyse the behavior of models trained on the widely used MIMIC-III database (Johnson et al., 2016). We analyse how sensitive these models are towards textual indicators of protected characteristics in a clinical note, such as age, gender and ethnicity. These characteristics are known to be affected by discrimination and bias in health care (Stangl et al., 2019), on the other hand, they can represent important risk factors for certain diseases or conditions. That is why we consider it especially important to understand how these mentions affect model decisions.


In summary, we present the following contributions in this work:
1) We introduce a novel behavioral testing framework specifically for clinical NLP models. We release the code for applying and extending the framework111URL: https://github.com/bvanaken/clinical-behavioral-testing to enable in-depth evaluations of clinical NLP models.
2) We present an analysis on the patient characteristics gender, age and ethnicity to understand the sensitivity of models towards textual cues regarding these groups and whether their predictions are medically plausible.
3) We show results of three state-of-the-art clinical NLP models and find that model behavior strongly varies depending on the applied pre-training. We further show that highly optimised models are often more prone to overestimate the effect of certain patient characteristics leading to potentially harmful behavior.

2 Related Work

Clinical Outcome Prediction.

Outcome prediction from clinical text has been studied regarding a variety of outcomes. The most prevalent being in-hospital mortality (Ghassemi et al., 2014, Jo et al., 2017, Suresh et al., 2018, Si and Roberts, 2019), diagnosis prediction (Tao et al., 2018, Liu et al., 2018, 2019a) and phenotyping (Liu et al., 2019b, Jain et al., 2019, Oleynik et al., 2019, Pfaff et al., 2020). In recent years, most approaches are based on deep neural networks due to their ability to outperform earlier methods in most of the settings. Most recently, Transformer-based models have been applied for prediction of patient outcomes with reported increases in performance (Huang et al., 2019, Zhang et al., 2020a, Tuzhilin, 2020, Zhao et al., 2021, van Aken et al., 2021, Rasmy et al., 2021). In this work we analyse three of these Transformer-based models due to their upcoming prevalence in the application of NLP in health care.

2.1 Behavioral Testing in NLP

Ribeiro et al. (2020) identify shortcomings of common model evaluation on held-out datasets, such as the occurrence of the same biases in both training and test set and the lack of comprehensive testing scenarios in the held-out set. To mitigate these problems, they introduce CheckList, a behavioral testing framework to test general NLP abilities. In particular, they highlight that such frameworks evaluate input-output behavior without any knowledge of internal structures of a system (Beizer, 1995). Building upon CheckList, Röttger et al. (2021) introduce a behavioral testing suite for the domain of hate speech detection to address the individual challenges of the task. Following their work, we create a behavioral testing framework for the domain of clinical outcome prediction, that comprise idiosyncratic data and respective challenges.

2.2 Revealing Biases in Clinical NLP

The problem of biases in clinical NLP models is already highlighted by Zhang et al. (2020b). They quantify such biases by focusing on the recall gap among patient groups and by applying an artificial fill-in-the-gap task. They show that the models trained on data from MIMIC-III inherit biases regarding gender, language, ethnicity, and insurance status–often in favor of the majority group. We take these findings as motivation to directly analyse the sensitivity of such models with regard to patient characteristics. In contrast to their work and following Ribeiro et al. (2020), we want to eliminate the influence of biased test data on our evaluation. Further, our approach simulates patient cases that are similar to real-life occurrences. It thus displays the actual impact of learned biases on all analysed patient groups.

3 Behavioral Testing of Clinical NLP Models

Sample alterations.

Our goal is to examine how clinical NLP models react to mentions of certain patient characteristics in text. Comparable to earlier approaches to behavioral testing we use sample alterations to artificially create different test groups. In our case, a test group is defined by one manifestation of a patient characteristic, such as female as the patient’s gender. In order to ensure that we only measure the influence of this certain characteristic, we keep the rest of the patient case unchanged and apply the alterations to all samples in our test dataset. Depending on the original sample, the operations to create a certain test groups thus include 1) changing a mention, 2) adding a mention or 3) keeping a mention unchanged (in case of a patient case that is already part of the test group at hand). This results in one newly created dataset per test group, all based on the same patient cases and only different in the patient characteristic under investigation.

Figure 2: Behavioral testing framework for the clinical domain. Schematic overview of the introduced framework. From an existing test set we create test groups by altering specific tokens in the clinical note. We then analyse the change in predictions which reveals the impact of the mention on the clinical NLP model.

Prediction analysis.

After creating the test groups, we collect the models’ predictions for all cases in each test group. Different from earlier approaches to behavioral testing we do not test whether predictions on the altered samples are true or false with regard to the ground truth. As van Aken et al. (2021) pointed out, there is no real ground truth in clinical data, because the data that is collected does only show one possible pathway for a patient out of many. Further, existing biases in treatments and diagnoses are likely included in our testing data potentially leading to meaningless results. To prevent that, we instead focus on detecting how the model outputs change regardless of the original annotations. This way we can also evaluate very rare mentions (e.g. transgender) and observe their impact on the model predictions reliably. Figure 2 shows a schematic overview of the functioning of the framework.


In this study, we use the introduced framework to analyse model behavior with regard to patient characteristics as described in 4.2. However, it can also be used to test more general model behavior such as the ability to identify negated symptoms or to detect specific diagnoses when certain indicators are present in the text. It is further possible to combine certain test groups e.g. to analyse how a model behaves on a combination of patient characteristics.

4 Experimental Setup

4.1 Data

We conduct our analysis on data from the MIMIC-III database (Johnson et al., 2016). In particular we use the outcome prediction task setup by van Aken et al. (2021). The classification task includes 48,745 admission notes annotated with the patients’ clinical outcomes at discharge. We select the outcomes diagnoses at discharge and in-hospital mortality for this analysis, since they have the highest impact on patient care and present a high potential to disadvantage certain patient groups. We use three models (see 4.3) trained on the two admission to discharge tasks and conduct our analysis on the test set defined by the authors with a total of 9,829 samples.

4.2 Considered Patient Characteristics

We choose three characteristics for the analysis in this work: Age, gender and ethnicity. While these characteristics differ in their importance as clinical risk factors, all of them are known to be subject to biases and stigmas in health care (Stangl et al., 2019). Therefore, we want to test, whether the analysed models have learned medically plausible patterns or ones that might be harmful to certain patient groups. We deliberately also include groups that occur very rarely in the original dataset. We want to understand the impact of imbalanced input data especially on minority groups, since they are already disadvantaged by the health care system (Riley, 2012, Bulatao and Anderson, 2004).

When altering the samples in our test set, we utilize that patients are described in a mostly consistent way at the beginning of a clinical note. We collect all mention variations from the training set used to describe the different patient characteristics and alter the samples accordingly in an automated setup.


The age of a patient is a significant risk factor for a number of clinical outcomes. Our test includes all ages between 18 and 89 and the [** Age over 90**] de-idenfitication label from the MIMIC-III database. van Aken et al. (2021) presented a comparable analysis on 20 random patient cases. We extend this analysis to all samples within a given testset for more reliable results. By analysing the model behavior on age mentions we can get insights on how the models interpret numbers, which is considered challenging for current NLP models (Wallace et al., 2019).


A patient’s gender is both a risk factor for certain diseases and also subject to unintended biases in healthcare. We test the model’s behavior regarding gender by altering the gender mention and by changing all pronouns in the clinical note. In addition to female and male, we also consider transgender as a gender test group in our study. This group is extremely rare in clinical datasets like MIMIC-III, but since approximately 1.4 million people in the U.S. identify as transgender (Flores et al., 2016), it is important to understand how model predictions change when the characteristic is present in a clinical note.


The ethnicity of a patient is only occasionally mentioned in clinical notes and its role in medical decision-making is controversial, since it can lead to disadvantages in patient care (Anderson et al., 2001, Snipes et al., 2011). Earlier studies have also shown that ethnicity in clinical notes is often incorrectly assigned (Moscou et al., 2003). We want to know how clinical NLP models interpret the mention of ethnicity in a clinical note and whether their behavior can cause unfair treatment. We choose White, African American, Hispanic and Asian as ethnicity groups for our evaluation, as they are the most frequent ethnicities in MIMIC-III.

Diagnoses 83.75 83.54 82.81
Mortality 84.28 84.04 82.55
Table 1: Performance of three state-of-the-art models on the outcome prediction tasks diagnoses (multi-label) and mortality prediction (binary task) in % AUROC. PubMedBERT outperforms the other two models in both tasks by a small margin.

4.3 Clinical NLP Models

In this study, we apply the introduced testing framework to three existing clinical models which are fine-tuned on the tasks of diagnosis and mortality prediction. We use the model checkpoints of van Aken et al. (2021) and additionally fine-tune the PubMedBERT model (Gu et al., 2020)

on the same training data with the same hyperparameter setup

222Hyperparameters: Batch size: 20; learning rate: 5e-05; dropout: 0.1; warmup steps: 1000; early stopping patience: 20.. The models are based on the BERT architecture (Devlin et al., 2019) as it presents the current state-of-the-art in predicting patient outcomes. Their performance on the two tasks is shown in Table 1. We deliberately choose three models based on the same architecture to investigate the impact of pre-training data while keeping architectural considerations aside. In general the proposed testing framework is model agnostic and works with any type of text-based outcome prediction model.


Lee et al. (2020) introduced BioBERT which is based on a pre-trained BERT Base (Devlin et al., 2019) checkpoint. They applied another language model fine-tuning step using biomedical articles from PubMed abstracts and full-text articles. BioBERT has shown improved performance on both medical and clinical downstream tasks.


Clinical Outcome Representations (CORe) by van Aken et al. (2021) are based on BioBERT and extended with a pre-training step that focuses on the prediction of patient outcomes. The pre-training data includes clinical notes, Wikipedia articles and case studies from PubMed.


Gu et al. (2020) recently introduced PubMedBERT based on similar data as BioBERT. They use PubMed articles and abstracts but instead of extending a BERT Base model, they train PubMedBERT from scratch. The model reaches state-of-the-art results on multiple medical NLP tasks and outperforms the other analysed models on the outcome prediction tasks.

5 Results

We present the results on all test cases by averaging the probabilities that a model assigns to each test sample. We then compare the averaged probabilities across test cases to identify which characteristics have a large impact on the model’s prediction over the whole test set. The values per diagnosis in the heatmaps shown in Figure 3, 4, 7 and 8 are defined using the following formula:


where is the value assigned to test group , is the (predicted) probability for a given diagnosis and is the number of all test groups except .

We choose this illustration to highlight both positive and negative influence of a characteristic on model behavior. Since all test groups are based on the same patients and only differ regarding the characteristic at hand, even small differences in the averaged predictions can point towards general patterns that the model learned to associate with a characteristic.

5.1 Influence of Gender

Transgender mention leads to lower mortality and diagnoses predictions.

Table 2 shows the mortality predictions of the three analysed models with regard to the gender assigned in the text. While the predicted mortality risk for female and male patients lies within a small range, all models predict the mortality risk of patients that are described as transgender as lower than non-transgender patients. This is probably due to the relative young age of most transgender patients in the MIMIC-III training data, but can be harmful to older patients identifying as transgender at inference time.

Figure 3: Influence of gender on predicted diagnoses. Blue: Predicted probability for diagnosis is below-average; red: predicted probability above-average. PubMedBERT shows highest sensitivity to gender mention and regards many diagnoses less likely if transgender is mentioned in the text. Graph shows deviation of probabilities on 24 most common diagnoses in test set.
Figure 4: Original distribution of diagnoses per gender in MIMIC-III. Cell colors: Deviation from average probability. Numbers in parenthesis: Occurrences in the training set. Most diagnoses occur less often in transgender patients due to their very low sample count.

Sensitivity to gender mention varies across models.

Figure 3 shows the change in model prediction for each diagnosis with regard to the gender mention. The cells of the heatmap are the deviations from the average score of the other test cases. Thus, a light cell indicates that the model assigns a higher probability to a diagnosis for this gender group. We see that PubMedBERT is highly sensitive to the change of the patient gender, especially regarding transgender patients. Except from few diagnoses such as Cardiac dysrhythmias and Drug Use / Abuse, the model predicts a lower probability to diseases if the patient letter contains the transgender mention. The CORe and BioBERT models are less sensitive in this regard. The most salient deviation of the BioBERT model is a drop in probability of Urinary tract disorders for male patients, which is medically plausible due to anatomic differences (Tan and Chlebicki, 2016).

Biases in MIMIC-III training data are partially inherited.

In Figure 4 we show the original distribution of diagnoses per gender in the training data. Note that the deviations are about 10 times larger than the ones produced by the model predictions in Figure 3. This indicates that the models take gender as a decision factor, but only among others. Due to the very rare occurrence of transgender mentions (only seven cases in the training data), most diagnoses are underrepresented for this group. This is partially reflected by the model predictions, especially by PubMedBERT, as described above. Other salient patterns such as the prevalence of Chronic ischemic heart disease in male patients are only reproduced faintly by the models.

Female 0.335 0.239 0.119
Male 0.333 0.245 0.121
Transgender 0.326 0.229 0.117
Table 2: Influence of gender on mortality predictions. PubMedBERT assigns highest risk to female, the other models to male patients. Notably, all models decrease their mortality prediction for transgender patients.

5.2 Influence of Age

Figure 5: Influence of age on mortality predictions. X-axis: Simulated age; y-axis: predicted mortality risk. The three models are differently calibrated and only CORe is highly influenced by age.

Mortality risk predictions are differently influenced by age.

Figure 5 shows the averaged predicted mortality per age for all models and the actual distribution from the training data (dotted line). We can see that BioBERT does not take age into account when predicting mortality risk except for patients over 90 (which are described by the tokens [**Age over 90 **] in MIMIC-III). The PubMedBERT model assigns a higher mortality risk to all age groups with a small increase for patients over 60 and an even steeper increase for patients over 90. The CORe model is following the training data the most and is also inheriting many peaks and troughs in the data.

Models are equally affected by age when predicting diagnoses.

We exemplify the impact of age on diagnosis prediction on eight outcome diagnoses in Figure 6. The dotted lines show the distribution of the diagnosis within an age group in the training data. The change of predictions regarding age are similar throughout the analysed models with only small variations such as for Cardiac dysrhythmias. Some diagnoses are regarded more probable in older patients (e.g. Acute Kidney Failure) and others in younger patients (e.g. Abuse of drugs). The distributions per age group in the training data are more extreme, but follow the same tendencies as predicted by the models.

Figure 6: Influence of age on diagnosis predictions. The x-axis is the simulated age and the y-axis is the predicted probability of a diagnosis. All models follow similar patterns with some diagnosis risks increasing with age and some decreasing. The original training distributions (black dotted line) are mostly followed but attenuated.

Prediction peaks indicate lack of number understanding.

From earlier studies we know that BERT-based models have difficulties dealing with numbers in text (Wallace et al., 2019). The peaks that we observe in some predictions support this finding. For instance, the models assign a higher risk of Cardiac dysrhythmias to patients aged 73 than to patients aged 74, because they do not capture that these are consecutive ages. Therefore, the influence of age on the predictions is solely based on the individual age tokens observed in the training data.

5.3 Influence of Ethnicity

Mention of any ethnicity decreases prediction of mortality risk.

Table 3 shows the mortality predictions when different ethnicities are mentioned and when there is no mention. We observe that the mention of any of the ethnicities leads to a decrease in mortality risk prediction in all models, with White and African American patients receiving the lowest probabilities.

Diagnoses predicted by PubMedBERT are highly sensitive to ethnicity mentions.

Figure 7 depicts the influence of ethnicity mentions on the three models. Notably, the predictions of PubMedBERT are strongly influenced by ethnicity mentions. Multiple diagnoses such as Chronic kidney disease are more often predicted when there is no mention of ethnicity, while diagnoses like Hypertension and Abuse of drugs are regarded more likely in African American patients and Unspecified anemias in Hispanic patients. While the original training data in Figure 8

shows the same strong variance among ethnicities, this bias is not inherited the same way in the CORe and BioBERT models. However, we can also observe deviations regarding ethnicity in these models.

No mention 0.333 0.243 0.120
White 0.329 0.235 0.119
African Amer. 0.329 0.239 0.116
Hispanic 0.331 0.237 0.118
Asian 0.330 0.238 0.118
Table 3: Influence of ethnicity on mortality predictions. The mention of an ethnicity decreases the predicted mortality risk. White and African American patients are assigned with the lowest mortality risk (gray-shaded).
Figure 7: Influence of ethnicity on diagnosis predictions. Blue: Predicted probability for diagnosis is below-average; red: predicted probability above-average. PubMedBERT’s predictions are highly influenced by ethnicity mentions, while CORe and BioBERT show smaller deviations, but also disparities on specific groups.
Figure 8: Original distribution of diagnoses per ethnicity in MIMIC-III. Cell colors: Deviation from average probability. Numbers in parenthesis: Occurrences in the training set. Both the distribution of samples and the occurrences of diagnoses are highly unbalanced in the training set. Some patterns are inherited by the fine-tuned models, while others are not.

African American patients are assigned lower risk of diagnoses by CORe and BioBERT.

The heatmaps showing predictions of CORe and BioBERT reveal a potentially harmful pattern in which the mention of African American in a clinical note decreases the predictions for a large number of diagnoses. This pattern is found more prominently in the CORe model, but also in BioBERT. This behavior can lead to disadvantages in the treatment of African American patients and would reinforce existing biases in health care (Nelson, 2002).

6 Discussion

Sensitivity and impact of characteristics show large variance.

The results described in 5 reveal large differences in the influence of patient characteristics throughout models. The analysis shows that there is no overall best model, but each model has learned both useful patterns (e.g. age as a medical plausible risk factor) and potentially dangerous ones (e.g. decreases in diagnosis risks for minority groups). The large variance is surprising since the models have a shared architecture and are fine-tuned on the same data–they only differ in their pre-training. And while the reported AUROC scores for the models (Table 1) are close to each other, the variance in learned behavior show that we should consider in-depth analyses a crucial part of model evaluation in the clinical domain. This is especially important since unintended biases in clinical NLP models are often fine-grained and difficult to detect.

Best performing model is especially sensitive to gender and ethnicity mentions.

The analysis has shown that PubMedBERT which outperforms the other models in both mortality and diagnosis prediction show larger sensitivity to mentions of gender and ethnicity in the text. This is alerting since it particularly affects minority groups which are already disadvantaged by the health care system. It also shows that instead of measuring clinical models regarding a single score, looking at their robustness and potential impact should be further emphasized.

De-biasing methods need to be aligned with medical knowledge.

The application of de-biasing approaches has shown to be effective in general language scenarios in the past (Sun et al., 2019b). While their evaluation is out of the scope of this work, we want to highlight that their application in clinical outcome prediction can be challenging. We argue that de-biasing methods cannot be applied to patient characteristics in clinical text in the same way as for general language. The decision about which characteristics should be considered a risk factor and their impact on outcome predictions should be aligned with medical knowledge. Therefore, we focus followup research towards iterative model learning using feedback loops with medical professionals to define favorable patterns and adverse ones.

7 Conclusion

In this work, we introduced a novel behavioral testing framework for the clinical domain that enables us to understand the effects of textual variations on the model’s prediction. We apply this framework to examine the impact of certain patient characteristics, and evaluate whether current NLP models reproduce dangerous biases in health care. Our results show that the models have indeed learned to overestimate certain characteristics especially those of minority groups which potentially lead to disadvantages. With this work we want to emphasize the importance of model evaluation beyond common metrics especially in sensitive areas like health care. For future research we propose additional behavioral analyses, e.g. regarding stigmatizing language in clinical notes as defined by Goddu et al. (2018). We also propose to apply the framework to evaluate different de-biasing approaches and to further develop approaches for removing harmful biases while keeping plausible patterns regarding clinical risk factors intact.


Our work is funded by the German Federal Ministry for Economic Affairs and Energy (BMWi) under grant agreement 01MD19003B (PLASS) and 01MK2008MD (Servicemeister).


  • M. Anderson, S. Moscou, C. Fulchon, and D. Neuspiel (2001) The role of race in the clinical presentation. Family medicine 33, pp. 430–4. Cited by: §4.2.
  • B. Beizer (1995) Black-box testing: techniques for functional testing of software and systems. John Wiley & Sons, Inc.. Cited by: §2.1.
  • R. A. Bulatao and N. B. Anderson (2004) Understanding racial and ethnic differences in health in late life: a research agenda. National Academies Press (US). Cited by: §4.2.
  • E. Choi, C. Xiao, W. F. Stewart, and J. Sun (2018) MiME: multilevel medical embedding of electronic health records for predictive healthcare. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 4552–4562. Cited by: §1, §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 4171–4186. Cited by: §4.3, §4.3.
  • A.R. Flores, J.L. Herman, G.J. Gates, and T.N.T. Brown (2016) How many adults identify as transgender in the united states?. Los Angeles, CA: The Williams Institute. Cited by: §4.2.
  • M. Ghassemi, T. Naumann, F. Doshi-Velez, N. Brimmer, R. Joshi, A. Rumshisky, and P. Szolovits (2014) Unfolding physiological state: mortality modelling in intensive care units. In The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA - August 24 - 27, 2014, S. A. Macskassy, C. Perlich, J. Leskovec, W. Wang, and R. Ghani (Eds.), pp. 75–84. Cited by: §2.
  • A. Goddu, K. O’Conor, S. Lanzkron, M. Saheed, S. Saha, C. Haywood, and M. C. Beach (2018) Do words matter? stigmatizing language and the transmission of bias in the medical record. Journal of General Internal Medicine 33, pp. . Cited by: §7.
  • Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon (2020) Domain-specific language model pretraining for biomedical natural language processing. Cited by: §4.3, §4.3.
  • K. Huang, J. Altosaar, and R. Ranganath (2019) Clinicalbert: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv:1904.05342. Cited by: §2.
  • S. Jain, R. Mohammadi, and B. C. Wallace (2019) An analysis of attention over clinical notes for predictive tasks. arXiv preprint arXiv:1904.03244. Cited by: §2.
  • Y. Jo, L. Lee, and S. Palaskar (2017) Combining lstm and latent topic modeling for mortality prediction. arXiv preprint arXiv:1709.02842. Cited by: §2.
  • A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-Wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-III, a Freely Accessible Critical Care Database. Scientific Data 3 (1), pp. 1–9. Cited by: §1, §4.1.
  • S. Khadanga, K. Aggarwal, S. R. Joty, and J. Srivastava (2019) Using clinical notes with time series data for ICU management. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 6431–6436. Cited by: §1.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §4.3.
  • D. Liu, D. Dligach, and T. A. Miller (2019a) Two-stage federated phenotyping and patient representation learning. In Proceedings of the 18th BioNLP Workshop and Shared Task, BioNLP@ACL 2019, Florence, Italy, August 1, 2019, D. Demner-Fushman, K. B. Cohen, S. Ananiadou, and J. Tsujii (Eds.), pp. 283–291. Cited by: §2.
  • D. Liu, D. Dligach, and T. Miller (2019b) Two-stage federated phenotyping and patient representation learning. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019, pp. 283. Cited by: §2.
  • J. Liu, Z. Zhang, and N. Razavian (2018) Deep EHR: chronic disease prediction using medical notes. In

    Proceedings of the Machine Learning for Healthcare Conference, MLHC 2018, 17-18 August 2018, Palo Alto, California

    , F. Doshi-Velez, J. Fackler, K. Jung, D. C. Kale, R. Ranganath, B. C. Wallace, and J. Wiens (Eds.),
    Proceedings of Machine Learning Research, Vol. 85, pp. 440–464. Cited by: §1, §2.
  • S. Moscou, M. R. Anderson, J. B. Kaplan, and L. Valencia (2003) Validity of racial/ethnic classifications in medical records data: an exploratory study. American journal of public health 93 (7), pp. 1084–1086. Cited by: §4.2.
  • A. Nelson (2002) Unequal treatment: confronting racial and ethnic disparities in health care.. Journal of the national medical association 94 (8), pp. 666. Cited by: §5.3.
  • M. Oleynik, A. Kugic, Z. Kasáč, and M. Kreuzthaler (2019)

    Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification

    Journal of the American Medical Informatics Association 26 (11), pp. 1247–1254. Cited by: §2.
  • E. R. Pfaff, M. Crosskey, K. Morton, and A. Krishnamurthy (2020) Clinical annotation research kit (clark): computable phenotyping using machine learning. JMIR medical informatics 8 (1), pp. e16042. Cited by: §2.
  • L. Rasmy, Y. Xiang, Z. Xie, C. Tao, and D. Zhi (2021) Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine 4 (1), pp. 1–13. Cited by: §2.
  • M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh (2020) Beyond accuracy: behavioral testing of NLP models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault (Eds.), pp. 4902–4912. Cited by: §1, §2.1, §2.2.
  • W. J. Riley (2012) Health disparities: gaps in access, quality and affordability of medical care. Transactions of the American Clinical and Climatological AssociationAmerican Clinical and Climatological Association 123, pp. 167. Cited by: §4.2.
  • P. Röttger, B. Vidgen, D. Nguyen, Z. Waseem, H. Z. Margetts, and J. B. Pierrehumbert (2021) HateCheck: functional tests for hate speech detection models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, C. Zong, F. Xia, W. Li, and R. Navigli (Eds.), pp. 41–58. Cited by: §2.1.
  • Y. Si and K. Roberts (2019) Deep patient representation of clinical notes via multi-task learning for mortality prediction. AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science 2019, pp. 779–788. Cited by: §1, §2.
  • S. Snipes, S. Sellers, A. Tafawa, L. Cooper, J. Fields, and V. Bonham (2011) Is race medically relevant? a qualitative study of physicians’ attitudes about the role of race in treatment decision-making. BMC health services research 11, pp. 183. Cited by: §4.2.
  • A. L. Stangl, V. A. Earnshaw, C. H. Logie, W. van Brakel, L. C. Simbayi, I. Barré, and J. F. Dovidio (2019) The health stigma and discrimination framework: a global, crosscutting framework to inform research, intervention development, and policy on health-related stigmas. BMC medicine 17 (1), pp. 1–13. Cited by: §1, §4.2.
  • I. Straw (2020)

    The automation of bias in medical artificial intelligence (AI): decoding the past to create a better future

    Artif. Intell. Medicine 110, pp. 101965. Cited by: §1.
  • T. Sun, A. Gaut, S. Tang, Y. Huang, M. ElSherief, J. Zhao, D. Mirza, E. M. Belding, K. Chang, and W. Y. Wang (2019a) Mitigating gender bias in natural language processing: literature review. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 1630–1640. Cited by: §1.
  • T. Sun, A. Gaut, S. Tang, Y. Huang, M. ElSherief, J. Zhao, D. Mirza, E. M. Belding, K. Chang, and W. Y. Wang (2019b) Mitigating gender bias in natural language processing: literature review. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, A. Korhonen, D. R. Traum, and L. Màrquez (Eds.), pp. 1630–1640. Cited by: §6.
  • H. Suresh, J. J. Gong, and J. V. Guttag (2018) Learning tasks for multitask learning: heterogenous patient populations in the ICU. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, UK, August 19-23, 2018, Y. Guo and F. Farooq (Eds.), pp. 802–810. Cited by: §2.
  • C. Tan and M. Chlebicki (2016) Urinary tract infections in adults. Singapore Medical Journal 57, pp. 485–490. Cited by: §5.1.
  • Y. Tao, B. Godefroy, G. Genthial, and C. Potts (2018) Effective feature representation for clinical text concept extraction. arXiv preprint arXiv:1811.00070. Cited by: §2.
  • A. Tuzhilin (2020) Predicting clinical diagnosis from patients electronic health records using bert-based neural networks. In Artificial Intelligence in Medicine: 18th International Conference on Artificial Intelligence in Medicine, AIME 2020, Minneapolis, MN, USA, August 25-28, 2020, Proceedings, Vol. 12299, pp. 111. Cited by: §2.
  • B. van Aken, J. Papaioannou, M. Mayrdorfer, K. Budde, F. A. Gers, and A. Löser (2021) Clinical outcome prediction from admission notes using self-supervised knowledge integration. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, April 19 - 23, 2021, P. Merlo, J. Tiedemann, and R. Tsarfaty (Eds.), pp. 881–893. Cited by: §2, §3, §4.1, §4.2, §4.3, §4.3.
  • E. Wallace, Y. Wang, S. Li, S. Singh, and M. Gardner (2019) Do NLP models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan (Eds.), pp. 5306–5314. Cited by: §4.2, §5.2.
  • D. Zhang, J. Thadajarassiri, C. Sen, and E. Rundensteiner (2020a) Time-aware transformer-based network for clinical notes series prediction. In Machine Learning for Healthcare Conference, pp. 566–588. Cited by: §2.
  • H. Zhang, A. X. Lu, M. Abdalla, M. McDermott, and M. Ghassemi (2020b) Hurtful words: quantifying biases in clinical contextual word embeddings. In Proceedings of the ACM Conference on Health, Inference, and Learning, pp. 110–120. Cited by: §2.2.
  • Y. Zhao, Q. Hong, X. Zhang, Y. Deng, Y. Wang, and L. Petzold (2021) BERTSurv: bert-based survival models for predicting outcomes of trauma patients. arXiv preprint arXiv:2103.10928. Cited by: §2.