Creating Fair Models of Atherosclerotic Cardiovascular Disease Risk

09/12/2018 ∙ by Stephen Pfohl, et al. ∙ 0

Guidelines for the management of atherosclerotic cardiovascular disease (ASCVD) recommend the use of risk stratification models to identify patients most likely to benefit from cholesterol-lowering and other therapies. These models have differential performance across race and gender groups with inconsistent behavior across studies, potentially resulting in an inequitable distribution of beneficial therapy. In this work, we leverage adversarial learning and a large observational cohort extracted from electronic health records (EHRs) to develop a "fair" ASCVD risk prediction model with reduced variability in error rates across groups. We empirically demonstrate that our approach is capable of aligning the distribution of risk predictions conditioned on the outcome across several groups simultaneously for models built from high-dimensional EHR data. We also discuss the relevance of these results in the context of the empirical trade-off between fairness and model performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Atherosclerotic cardiovascular disease (ASCVD), which includes heart attack, stroke, and fatal coronary heart disease, is a major cause of mortality and morbidity worldwide, as well as in the U.S., where it contributes to 1 in 3 of all deaths–many of which are preventable [Benjamin et al.2018]

. In deciding whether to prescribe cholesterol-lowering therapies to prevent ASCVD, physicians are often guided by risk estimates yielded by the

Pooled Cohort Equations (PCEs). PCEs provide a proportional hazards model [Stone et al.2014, Goff et al.2013] that leverages nine clinical measurements to predict the 10-year risk of a first ASCVD event. However this model has been found to overestimate risk for female patients [Mora et al.2018], Chinese patients [De Filippis et al.2017] or globally [Yadlowsky et al.2018, Pylypchuk et al.2018], as well as also underestimate risk for other groups such as Korean women [Jung et al.2015]. Such mis-estimation results in an inequitable distribution of the benefits and harms of ASCVD risk scoring, because incorrect risk estimates can expose patients to substantial harm through both under- or over-treatment; potentially leading to preventable cardiovascular events or side effects from unnecessary therapy, respectively.

The inability of the PCEs to generalize to diverse cohorts likely owes to both under-representation of minority populations in the cohorts used to develop the PCEs and shifts in medical practice and lifestyle patterns in the decades since data collection for those cohorts. In attempting to correct for these patterns, one recent study [Yadlowsky et al.2018]

updated the PCEs using data from contemporary cohorts and demonstrated that doing so reduced the number of minority patients incorrectly misclassified as being high or low risk. Similar results were observed in the same study with an approach using an elastic net classifier, rather than a proportional hazards model. However, neither approach is able to explicitly guarantee an equitable distribution of mis-estimation across relevant subgroups, particularly for race- and gender-based subgroups.

To account for under-represented minorities and to take advantage of the wider variety of variables made available in electronic health records (EHRs), we derive a large and diverse modern cohort from EHRs to learn a prediction model for ASCVD risk. Furthermore, we investigate the extent to which we can encode algorithmic notions of fairness, specifically equality of odds, [Hardt, Price, and Srebro2016]

into the model to encourage an equitable distribution of utility across subpopulations. To the best of our knowledge, our effort is the first to explore the extent to which this formal fairness metric is achievable for risk prediction models built using high-dimensional data from the EHR. We show that while it is feasible to develop models that achieve equality of odds, we emphasize that this process involves trade-offs that must be assessed in a broader social and medical context

[Verghese, Shah, and Harrington2018].

Background and Related Work

ASCVD Risk Prediction and EHRs

The PCEs are based on age, gender, cholesterol levels, blood pressure, and smoking and diabetes status and were developed by pooling data from five large U.S. cohorts [Stone et al.2014] composed of white and black patients, with white patients constituting a majority. Recently, attempts [Yadlowsky et al.2018] were made to update the PCEs to improve model performance for race- and gender-based subgroups using elastic net regression and data from modern prospective cohorts. However, this effort focused on demographic groups and variables already used to develop the PCEs and did not consider other populations or clinical measurements. The increasing adoption of EHRs offers opportunities to deploy and refine ASCVD risk models. Efforts have recently been undertaken to apply existing models, including the PCEs, QRISKII, and the Framingham score, to large EHR-derived cohorts and characterize their performance in certain subgroups [Asaria et al.2016, Pike et al.2016, Rana et al.2016], or to develop new models using EHRs directly [Rapsomaniki et al.2014]. Beyond ASCVD risk prediction, there exist many recent works that develop prediction models with EHRs, which are reviewed in [Goldstein et al.2017] and [Xiao, Choi, and Sun2018].

Fair Risk Prediction

We consider the case where supervised learning is used to estimate a function

that approximates the conditional distribution , given samples drawn from the distribution . We take

to correspond to a vector representation of the medical history extracted from the EHR prior to a patient-specific index time

; to be a binary label, which for patient , indicates the presence of the outcome observed in the EHR in the time frame , where is a parameter specifying the amount of time following the index time used to derive the outcome; and indicates a sensitive attribute, such as race, gender, or age, with groups. The output of the learned function is then thresholded with respect to a value to yield a prediction .

One standard metric for assessing the fairness of a classifier with respect to a sensitive attribute is demographic parity [Dwork et al.2012, Zemel et al.2013], which evaluates the independence between and the prediction . However, optimizing for demographic parity is often of limited use for clinical risk prediction, because doing so may preclude the model from considering relevant clinical features associated with the sensitive attribute, thus decreasing performance and utility of the model for all groups [Zafar et al.2017].

Another related metric is equality of odds [Hardt, Price, and Srebro2016], which stipulates that the prediction should be conditionally independent of , given the true label . Formally, satisfying equality of odds implies that

(1)

From this, it can be seen that, if equality of odds is achieved, then for a fixed threshold , both the false positive (FPR) and false negative rates (FNR) are equal across all pairs of groups defined by . Furthermore, this definition can be extended to the case of a continuous risk score by requiring that

(2)

In this case, the distribution of the predicted probability of the outcome conditioned on whether the event occurred or not should be matched across groups of a sensitive variable. Formulation

2 is stronger than 1 since it implies that equality of odds is achieved for all possible thresholds. This is desirable since it provides the end-user the ability to freely adjust the decision threshold of the model without violating equality of odds.

Finally, we also note that satisfying equality of odds for a continuous risk score may be reduced to the problem of minimizing a distance metric over each pair of distributions referenced in equation (2). Adversarial learning procedures [Goodfellow et al.2014]

are well-suited to this problem in that they provide a flexible framework for minimizing the divergence over distributions parameterized by neural networks

[Uehara et al.2016, Mohamed and Lakshminarayanan2016]. As such, several related works [Zhang, Lemoine, and Mitchell2018, Beutel et al.2017, Edwards and Storkey2015, Madras et al.2018] have demonstrated the benefit of augmenting a classifier with an adversary in order to align the distribution of predictions for satisfying fairness constraints.

Approaches for Achieving Fairness

Despite considerable interest in the ethical implications of implementing machine learning in healthcare

[Char, Shah, and Magnus2018, Cohen et al.2014], relatively little work exists characterizing the extent to which risk prediction models developed with EHR data satisfy formal fairness constraints.

Adversarial approaches for satisfying fairness constraints (in the form of demographic parity) have been explored in several recent works in non-healthcare domains. One approach, [Edwards and Storkey2015]

, in the context of image anonymization, demonstrated that representations satisfying demographic parity could be learned by augmenting a predictive model with both an autoencoder and an adversarial component. The adversarial approach to fairness was further investigated by

[Beutel et al.2017] with a gradient reversal objective for data that is imbalanced in the distribution of both the outcome and in the sensitive attribute.

In attempting to address the limitations of demographic parity as a metric, [Hardt, Price, and Srebro2016] introduced equality of odds as an alternative and devised post-processing methods to achieve it for fixed-threshold classifiers. Recently, [Zhang, Lemoine, and Mitchell2018] and [Madras et al.2018] generalized the adversarial framework to achieve equality of odds by providing the adversary access to the value of the outcome. Such approaches are methodologically similar to those employed in domain-adversarial training [Ganin et al.2016] and may be seen as a special case of the more general algorithm presented in [Louppe, Kagan, and Cranmer2017].

Both demographic parity and equality of odds are referred to as group fairness metrics since they are concerned with encouraging an invariance of some property of a classifier over groups of a sensitive attribute. While straightforward to compute and reason about, optimizing for these metrics may produce models that are discriminatory over subgroups defined by combinations of sensitive attributes, constituting a form of fairness gerrymandering [Kearns et al.2017]. The competing notion of individual fairness [Dwork et al.2012] can address these concerns by assessing whether a model produces similar outputs for similar individuals, in a formalism similar to differential privacy. However, this notion is often of limited practical use due to computational challenges associated with satisfying individual fairness. Recent efforts [Hébert-Johnson et al.2017, Kim, Ghorbani, and Zou2018] have investigated an alternative to both group and individual fairness metrics with a process that audits a classifier to discover subgroups for which the model is under-performing and iteratively improve model performance for those groups, ultimately resulting in a non-negative change in model performance for all computationally-identifiable subgroups.

The closest related work examining the fairness of risk prediction models in healthcare is [Chen, Johansson, and Sontag2018]

, which, in the context of mortality prediction in intensive care units, argued that any trade-off between model performance and fairness across subgroups is undesirable. They propose that the prediction error should be decomposed in terms of bias, variance, and noise and that the relative contribution of these terms be used to guide additional data collection.

Methods

Group Count
ASCVD
Incidence (%)
Follow-up
Length (years)
Asian 30,294 2.3 3.2
Black 8,549 3.0 3.2
Hispanic 20,240 2.0 2.9
Other 19,062 2.2 3.1
Unknown 39,964 0.86 3.1
White 135,438 2.8 3.6
Female 149,594 1.9 3.4
Male 103,953 2.9 3.3
40-55 121,437 0.95 3.4
55-65 61,214 2.1 3.5
65-75 43,800 3.7 3.2
75+ 27,096 6.7 3.0
All 253,547 2.8 3.4
Table 1: Cohort characteristics. The number of patients extracted, the incidence of the ASCVD outcome and the average length of follow-up for each subgroup are shown.
Race Gender Age
Standard EQrace Standard EQgender Standard EQage
FNR, CV 0.126 0.1 0.102 0.0164 0.382 0.129
FPR, CV 0.538 0.383 0.45 0.12 1.05 0.205
Mean EMD 0.00749 0.00616 0.00875 0.0026 0.0239 0.00312
Mean EMD 0.0226 0.0237 0.0167 0.00593 0.0602 0.0209
Table 2:

Distribution alignment metrics. We report the coefficient of variation (CV; the ratio of the standard deviation to the mean) of the false positive rate (FPR, CV) and false negative rate (FNR, CV) at a fixed decision threshold of 0.075 across the race, gender, and age groups. Furthermore, we compute the pairwise earth mover’s distance (EMD) between distributions of the predicted probabilities of having an ASCVD event, conditioned on the true ASCVD label

for each group of each sensitive attribute and take the mean.

The Dataset and Cohort Definition

We extract records from the Stanford Translational Research Integrated Database [Lowe et al.2009], a clinical data warehouse containing records on roughly three million patients from Stanford Hospital and Clinics and Lucile Packard Children’s Hospital for clinical encounters occurring between 1990 and 2017.

We define a prediction task that resembles the setting in which the PCEs were developed for the purpose of guiding physician decision-making in ASCVD prevention and construct a corresponding cohort. As a first step, we identify all patients with at least two clinical encounters over at least two years for which they are 40 years of age or older. Then, for each patient we select an index time uniformly at random from the interval that allows for at least one year of history and one year of follow-up. We excluded from the cohort patients that have an history of cardiovascular artery disease (including ASCVD and atrial fibrillation) or a prescription of an anti-hypertensive drug in the five years prior to the index time.

Finally, we assign a positive ASCVD label for a patient if a diagnosis code for an ASCVD event is observed at any point in their record following the index time. The exclusion criteria (i.e. the list of cardiovascular-related diseases and medications) is provided as supplementary material, along with the list of clinical concepts used for defining ASCVD events. The patients are randomly partitioned such that 80%, 10%, 10% are used for training, validation, and testing, respectively.

Sensitive Attributes

We consider race, gender, and age as sensitive attributes and assess model performance and fairness with respect to them. For race, we use both race and ethnicity variables to partition the cohort into six disjoint groups: Asian, Black, Hispanic, Other, Unknown, and White. Patients not considered Hispanic thus have either a non-Hispanic or unknown ethnicity. For gender, we partition the cohort into male and female populations. For age, we discretize the age at the index time into four disjoint groups: 40-55, 55-65, 65-75, and 75+ years, where the intervals are inclusive on the lower bound and exclusive on the upper bound. A summary of these groups is presented in Table 1.

Feature Extraction

For feature extraction, we adopt a strategy similar to the one described in

[Reps et al.2018] to convert time-stamped sequences of clinical concepts across several domains (i.e., diagnoses, procedures, medication orders, lab tests, clinical encounter types, departments, and other observations) into a static representation suitable for modeling. For each extracted patient, we filter the historical record to include only those concepts occurring prior to the index time. We encode as a binary attribute each unique clinical concept observed in the dataset according to whether that concept was present anywhere in the patient’s history prior to the prediction time; otherwise, it is absent or missing. Similarly, we do not use the numeric results of lab tests or vital measurements, but only include the presence of their measurement. In all models, we include race, gender, and age as features without regards as to whether the variable is treated as sensitive or not.

Adversarial Learning for Equality of Odds

To develop an ASCVD risk prediction model that satisfies the definition of equality of odds in (2), we consider two fully-connected neural networks: a classifier parameterized by that predicts the probability of the ASCVD outcome given data ; and an adversary parameterized by

that takes as input both the logit of the output of

and the value of the true label to predict a distribution over the groups of a sensitive attribute . If and are the cross-entropy losses of the classifier predictions over and the adversary predictions over , respectively, then the training procedure may be described by alternating between the steps

(3)

Model Training and Evaluation

The training procedure is composed of four experiments and thus produces four prediction models. The first model is trained to predict the raw risk of ASCVD and does not use adversarial training. The other three models result from separate training runs in which each of the discrete race, gender, and age variables are considered as sensitive attributes in the adversarial training procedure. We refer to these four experiments as Standard, EQrace, EQgender, and EQage.

For all experiments, we employ fully-connected feedforward neural networks with a fixed set of hyperparameters. The ASCVD prediction model is composed of the sum over an embedding layer of dimension

followed by two hidden layers of dimension

with batch normalization

[Ioffe and Szegedy2015]

and leaky ReLU nonlinearities

[Maas, Hannun, and Ng2013]. The adversarial network maintains a similar architecture, but with one hidden layer of dimension 64 and takes the prediction logit and ASCVD outcome as inputs. Training proceeds in a batch setting with the Adam optimizer [Kingma and Ba2014] with learning rate , , and with batch size

over the training set and early stopping based on the area under the receiver operating characteristic curve (AUC-ROC) for ASCVD prediction in the validation set. All training was performed on a single GPU with the PyTorch library

[Paszke et al.2017].

For each model, we compute standard metrics on the entire test set and on each subgroup. Specifically, we report the AUC-ROC, the area under the precision-recall curve (AUC-PRC), the Brier score [Brier1950] as a measure of calibration, and the false positive and false negative rates (FPR, FNR) at a fixed threshold of , in keeping with current ASCVD guidelines for the prescription of statin therapy [Stone et al.2014, Yadlowsky et al.2018]. To express adherence to the standard equality of odds definition in equation 1, we report the coefficient of variation (i.e. the ratio of the standard deviation to the mean) of the FPR and FNR at across the groups of each sensitive attribute. To assess the distance between the distributions presented in (2), we compute the earth mover’s distance (EMD, or first Wasserstein distance) [Ramdas, Trillos, and Cuturi2017] between the empirical distributions of the predicted probability of ASCVD conditioned on whether ASCVD occurred or not for each group of each sensitive attribute in a pairwise fashion and take the mean within each strata.

Results

Figure 1: Empirical distribution of the predicted probability of developing ASCVD in the followup period conditioned on whether ASCVD occurred. Plots are stratified by experimental condition (Standard or EQ), true value of the ASCVD outcome (y = 0 or y = 1), and the variable treated as sensitive (race, gender, or age).
Standard EQrace EQgender EQage
AUC-ROC 0.793 0.772 0.779 0.743
AUC-PRC 0.133 0.125 0.13 0.0965
Brier Score 0.0205 0.0207 0.0206 0.0211
Table 3: Model performance measured on the test set without stratification for each experimental condition.
AUC-ROC AUC-PRC Brier Score FNR FPR
Stand. EQ Stand. EQ Stand. EQ Stand. EQ Stand. EQ
Asian 0.819 0.771 0.138 0.155 0.0196 0.0197 0.683 0.587 0.0388 0.0903
Black 0.753 0.756 0.162 0.2 0.034 0.0338 0.621 0.69 0.0781 0.0437
Hispanic 0.811 0.803 0.117 0.0816 0.0142 0.015 0.667 0.6 0.0391 0.0945
Other 0.813 0.822 0.13 0.113 0.0217 0.0219 0.711 0.556 0.0544 0.102
Unknown 0.713 0.718 0.0619 0.0406 0.00766 0.00812 0.844 0.719 0.00944 0.0353
White 0.774 0.766 0.146 0.155 0.0245 0.0245 0.6 0.619 0.0804 0.0714
Female 0.8 0.786 0.12 0.122 0.0173 0.0174 0.684 0.625 0.0423 0.0567
Male 0.775 0.769 0.148 0.143 0.0249 0.025 0.592 0.64 0.0818 0.0672
40-55 0.713 0.727 0.0404 0.0275 0.0085 0.00922 0.952 0.817 0.00683 0.0573
55-65 0.736 0.708 0.0919 0.0676 0.0195 0.0198 0.794 0.746 0.0409 0.0618
65-75 0.736 0.739 0.128 0.141 0.0349 0.0347 0.608 0.669 0.115 0.088
75+ 0.776 0.763 0.228 0.224 0.053 0.0548 0.351 0.607 0.251 0.0806
Table 4: Model performance measured on the test set stratified by group and experimental condition. EQ corresponds to training for the sensitive attribute corresponding to the subgroup of interest. FPR and FNR are computed at a fixed decision threshold of 0.075.

Cohort Characteristics

The cohort extraction procedure produces a cohort of 253,547 patients having 71,554 features, with 5,886 patients labeled as positive for ASCVD (Table 1). We note that in this cohort, there are 135,438 white patients, constituting a majority, and 8,549 black patients. Across racial groups, ASCVD rates range from 2.0-3.0%, with the exception of patients with unknown race, who experience a reduced rate of 0.86%. Furthermore, we observe higher ASCVD rates for male patients compared to female patients. Finally, ASCVD rates appear to increase monotonically with age, with rates ranging from 0.95% for the 40-55 age group to 6.7% for patients age 75 or older.

Distribution Alignment with Adversarial Training

Applying the adversarial training procedure results in a alignment of the distributions of the predicted probability of ASCVD conditioned on the true outcome label (Figure 1). Without employing an adversary, the center of mass of these distributions appears to depend significantly on the base ASCVD rate in the group. However, these differences largely disappear when training in an adversarial setting. This results in a substantial reduction in the mean pairwise EMD between each predictive distribution in both outcome strata for both gender and age, with a negligible effect for race (Table 2). Furthermore, we note that variability in the FPRs and FNRs at a fixed threshold of is greatly reduced following adversarial training (Table 2), indicating that the approach is successful at producing a model satisfying the equality of odds metric.

The relative lack of success in minimizing the mean pairwise EMD between the conditional predictive distributions across racial groups (Table 2) may be largely explained by the anomalous characteristics of the group of patients having unknown race. For instance, when using standard training (Standard), the predictive distribution conditioned on a positive ASCVD outcome for the unknown race group is clearly separated from that of the five groups while the distributions for those five are mostly aligned (Figure 1). However, when training the model in an adversarial setting, it appears that the primary effect is to align the predictive distribution for the unknown race group to the region inhabited by the distributions of the remaining groups while disturbing the relative alignment between the distributions for those groups.

The Cost of Fairness

Satisfying equality of odds with an adversarial objective incurs a reduction in AUC-ROC, AUC-PRC, and calibration for the population at large (Table 3), with the largest negative effects observed when training to adjust for the differences across age groups (Standard AUC-ROC = 0.793 vs. EQage AUC-ROC = 0.743). However, for ranking metrics such as the AUC-ROC, the effects can be unintuitive following an adjustment of the subgroup predictive distributions. For instance, the adversarial training procedure for age actually leads to an increase in AUC-ROC for the majority 40-55 years group (Standard AUC-ROC = 0.713 vs. EQage AUC-ROC = 0.727) (Table 4) despite the stark decline in the AUC-ROC observed on the population as a whole. Furthermore, several of the populations assessed experience a reduction in performance for some metrics with improvements in others following training for equality of odds. In other cases, the effect is largely positive. Notably, model performance improves on all metrics except for the fixed threshold FNR for the black population, a group for which the model attains the lowest AUC-ROC (0.753) and is the least well-calibrated (Brier Score = 0.034) for the standard setting.

It has been shown that developing a well-calibrated model is an objective that conflicts with that of satisfying equality of odds [Pleiss et al.2017, Kleinberg, Mullainathan, and Raghavan2016, Chouldechova2017]. In our case, we observed such a trade-off, but judged it to be minor due to a small increase in the Brier score for almost every subgroup following training for equality of odds (Table 2).

Discussion

We have demonstrated empirically that adversarial learning procedures produce models whose predictions satisfy equality of odds. These models thus produce predictions that are more fair for under-represented patient subgroups. However, the equality of odds comes at the cost of a reduction in AUC-ROC and AUC-PRC for most subgroups.

Limitations of the Predictive Model

While using EHR data allowed a high-capacity ASCVD risk prediction model to be trained using a large and diverse cohort, this model should not be directly compared to the PCEs for several reasons. The PCEs estimate ten-year ASCVD risk, whereas our model estimates risk over a period of at least a year. Furthermore, we cannot rule out the existence of biases that may lead to differential rates of selection into our cohort across age, gender, and race based subgroups, nor can we establish whether the nature of these biases differ from those present in the prospective cohort studies used to derive the PCEs.

Moving Beyond Equality of Odds

While we have demonstrated empirically that adversarial learning procedures are capable of making a model satisfy equality of odds, the use of this metric as a notion of fairness should be approached with caution. Satisfying equality of odds at a fixed decision threshold effectively is a post-hoc correction to the model that randomizes some predictions for those groups for which the model performs well [Hardt, Price, and Srebro2016, Pleiss et al.2017]

, thus withholding potentially beneficial risk estimates. In the adversarial training setting, this reduction in performance may be offset by effective transfer learning to populations for which the model performs poorly. However, we observed that if such a benefit from transfer learning exists, it is smaller than the reduction in performance incurred for most groups.

The approach proposed in [Hébert-Johnson et al.2017, Kim, Ghorbani, and Zou2018] that allows for non-negative changes in utility for all subgroups, including those not explicitly specified as sensitive, is a promising future direction for clinical risk prediction models. Our current approach only considers the fairness of the model over the sub-groups resulting from a single sensitive attribute without consideration of sub-groups defined via combinations of these attributes (e.g. non-white females) or on implicit subgroups not considered sensitive.

We have not examined the relationship between the errors of the predictive model and notions of long-term utility when deploying the model clinically. To properly analyze the effect of these errors on utility requires careful modeling of the sequential decision-making process following ASCVD risk prediction while accounting for individual patient characteristics. We emphasize that while such a process is crucial to evaluate the long-term impact of any prediction model, it is not possible to properly examine that decision-making process with observational data in the EHR alone [Hardt, Price, and Srebro2016, Kilbertus et al.2017]. Additionally, it is unclear that satisfying fairness constraints for a single-step decision, as in ASCVD risk prediction, aligns with the goal of equitably maximizing long-term utility, as it has been shown that satisfying fairness constraints for a static decision may actually cause long-term harm in settings where an unconstrained objective would not [Liu et al.2018].

Conclusion

Existing approaches to ASCVD risk scoring perform poorly for the population at large, with more extreme risk mis-estimates for minority populations, inadvertently exposing those groups to excess harm. We develop an ASCVD prediction model using EHR data and show that we can encourage formal notions of fairness by reducing the variability in the FPR and FNR across groups. It is not yet known to what extent algorithmic notions of fairness align with other goals, including long-term utility maximization. We hope that our results will serve as an impetus for the community at large to investigate the fairness-utility trade-off during sequential clinical decision making resulting from fairness constraints imposed on clinical risk assessments.

Acknowledgements

We would like to thank Sam Corbett-Davies for early advice and insightful discussion. We thank Julia Daniels and Sebastian Le Bras for their contributions to an earlier version of this project.

References

  • [Asaria et al.2016] Asaria, M.; Walker, S.; Palmer, S.; Gale, C. P.; Shah, A. D.; Abrams, K. R.; Crowther, M.; Manca, A.; Timmis, A.; Hemingway, H.; and Sculpher, M. 2016. Using electronic health records to predict costs and outcomes in stable coronary artery disease. Heart 102(10):755–762.
  • [Benjamin et al.2018] Benjamin, E. J.; Virani, S. S.; Callaway, C. W.; Chang, A. R.; Cheng, S.; Chiuve, S. E.; Cushman, M.; Delling, F. N.; Deo, R.; de Ferranti, S. D.; Ferguson, J. F.; Fornage, M.; Gillespie, C.; Isasi, C. R.; Jiménez, M. C.; Jordan, L. C.; Judd, S. E.; Lackland, D.; Lichtman, J. H.; Lisabeth, L.; Liu, S.; Longenecker, C. T.; Lutsey, P. L.; Matchar, D. B.; Matsushita, K.; Mussolino, M. E.; Nasir, K.; O’Flaherty, M.; Palaniappan, L. P.; Pandey, D. K.; Reeves, M. J.; Ritchey, M. D.; Rodriguez, C. J.; Roth, G. A.; Rosamond, W. D.; Sampson, U. K.; Satou, G. M.; Shah, S. H.; Spartano, N. L.; Tirschwell, D. L.; Tsao, C. W.; Voeks, J. H.; Willey, J. Z.; Wilkins, J. T.; Wu, J. H.; Alger, H. M.; Wong, S. S.; and Muntner, P. 2018. Heart Disease and Stroke Statistics—2018 Update: A Report From the American Heart Association. Circulation 137(12):CIR.0000000000000558.
  • [Beutel et al.2017] Beutel, A.; Chen, J.; Zhao, Z.; and Chi, E. H. 2017. Data decisions and theoretical implications when adversarially learning fair representations. arXiv preprint arXiv:1707.00075.
  • [Brier1950] Brier, G. W. 1950. Verification of forecasts expressed in terms of probability. Monthly Weather Review 78(1):1–3.
  • [Char, Shah, and Magnus2018] Char, D. S.; Shah, N. H.; and Magnus, D. 2018. Implementing Machine Learning in Health Care — Addressing Ethical Challenges. New England Journal of Medicine 378(11):981–983.
  • [Chen, Johansson, and Sontag2018] Chen, I.; Johansson, F. D.; and Sontag, D. 2018. Why is my classifier discriminatory? In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18.
  • [Chouldechova2017] Chouldechova, A. 2017. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. ArXiv e-prints.
  • [Cohen et al.2014] Cohen, I. G.; Amarasingham, R.; Shah, A.; Xie, B.; and Lo, B. 2014. The Legal And Ethical Concerns That Arise From Using Complex Predictive Analytics In Health Care. Health Affairs 33(7):1139–1147.
  • [De Filippis et al.2017] De Filippis, A. P.; Young, R.; McEvoy, J. W.; Michos, E. D.; Sandfort, V.; Kronmal, R. A.; McClelland, R. L.; and Blaha, M. J. 2017. Risk score overestimation: The impact of individual cardiovascular risk factors and preventive therapies on the performance of the American Heart Association-American College of Cardiology-Atherosclerotic Cardiovascular Disease risk score in a modern mult. European Heart Journal 38(8):598–608.
  • [Dwork et al.2012] Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; and Zemel, R. 2012. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, 214–226. ACM.
  • [Edwards and Storkey2015] Edwards, H., and Storkey, A. 2015. Censoring Representations with an Adversary. arXiv preprint arXiv:1511.05897.
  • [Ganin et al.2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V.; Dogan, U.; Kloft, M.; Orabona, F.; and Tommasi, T. 2016. Domain-Adversarial Training of Neural Networks. Journal of Machine Learning Research 17:1–35.
  • [Goff et al.2013] Goff, D. C.; Lloyd-Jones, D. M.; Bennett, G.; Coady, S.; D’Agostino, R. B.; Gibbons, R.; Greenland, P.; Lackland, D. T.; Levy, D.; O’Donnell, C. J.; Robinson, J. G.; Schwartz, J. S.; Shero, S. T.; Smith, S. C.; Sorlie, P.; Stone, N. J.; and Wilson, P. W. F. 2013. 2013 acc/aha guideline on the assessment of cardiovascular risk. Circulation 129(25):S49–S73.
  • [Goldstein et al.2017] Goldstein, B. A.; Navar, A. M.; Pencina, M. J.; and Ioannidis, J. P. A. 2017. Opportunities and challenges in developing risk prediction models with electronic health records data: A systematic review. Journal of the American Medical Informatics Association 24(1):198–208.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative Adversarial Nets. Advances in Neural Information Processing Systems 27 2672–2680.
  • [Hardt, Price, and Srebro2016] Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of Opportunity in Supervised Learning. Advances in Neural Information Processing Systems 3315–3323.
  • [Hébert-Johnson et al.2017] Hébert-Johnson, Ú.; Kim, M. P.; Reingold, O.; and Rothblum, G. N. 2017. Calibration for the (Computationally-Identifiable) Masses. In Dy, J., and Krause, A., eds., Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, 1939–1948. Stockholmsmässan, Stockholm Sweden: PMLR.
  • [Ioffe and Szegedy2015] Ioffe, S., and Szegedy, C. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37, 448–456. JMLR. org.
  • [Jung et al.2015] Jung, K. J.; Jang, Y.; Oh, D. J.; Oh, B.-H.; Lee, S. H.; Park, S.-W.; Seung, K.-B.; Kim, H.-K.; Yun, Y. D.; Choi, S. H.; Sung, J.; Lee, T.-Y.; hi Kim, S.; Koh, S. B.; Kim, M. C.; Chang Kim, H.; Kimm, H.; Nam, C.; Park, S.; and Jee, S. H. 2015. The ACC/AHA 2013 pooled cohort equations compared to a Korean Risk Prediction Model for atherosclerotic cardiovascular disease. Atherosclerosis 242(1):367–375.
  • [Kearns et al.2017] Kearns, M.; Neel, S.; Roth, A.; and Wu, Z. S. 2017. Preventing Fairness Gerrymandering: Auditing and Learning for Subgroup Fairness. arXiv preprint arXiv:1711.05144.
  • [Kilbertus et al.2017] Kilbertus, N.; Carulla, M. R.; Parascandolo, G.; Hardt, M.; Janzing, D.; and Schölkopf, B. 2017. Avoiding discrimination through causal reasoning. In Advances in Neural Information Processing Systems, 656–666.
  • [Kim, Ghorbani, and Zou2018] Kim, M. P.; Ghorbani, A.; and Zou, J. 2018. Multiaccuracy: Black-Box Post-Processing for Fairness in Classification. arXiv preprint arXiv:1805.12317.
  • [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Kleinberg, Mullainathan, and Raghavan2016] Kleinberg, J.; Mullainathan, S.; and Raghavan, M. 2016. Inherent Trade-Offs in the Fair Determination of Risk Scores. arXiv preprint arXiv:1609.05807.
  • [Liu et al.2018] Liu, L. T.; Dean, S.; Rolf, E.; Simchowitz, M.; and Hardt, M. 2018. Delayed Impact of Fair Machine Learning. In Proceedings of the 35th International Conference on Machine Learning.
  • [Louppe, Kagan, and Cranmer2017] Louppe, G.; Kagan, M.; and Cranmer, K. 2017. Learning to pivot with adversarial networks. In Advances in Neural Information Processing Systems, 981–990.
  • [Lowe et al.2009] Lowe, H. J.; Ferris, T. A.; Hernandez, P. M.; and Weber, S. C. 2009. STRIDE–An integrated standards-based translational research informatics platform. AMIA … Annual Symposium proceedings. AMIA Symposium 2009:391–5.
  • [Maas, Hannun, and Ng2013] Maas, A. L.; Hannun, A. Y.; and Ng, A. Y. 2013. Rectifier Nonlinearities Improve Neural Network Acoustic Models. Proceedings of the 30 th International Conference on Machine Learning 28:6.
  • [Madras et al.2018] Madras, D.; Creager, E.; Pitassi, T.; and Zemel, R. 2018. Learning Adversarially Fair and Transferable Representations. arXiv preprint arXiv:1802.06309.
  • [Mohamed and Lakshminarayanan2016] Mohamed, S., and Lakshminarayanan, B. 2016. Learning in implicit generative models. arXiv preprint arXiv:1610.03483.
  • [Mora et al.2018] Mora, S.; Wenger, N. K.; Cook, N. R.; Liu, J.; Howard, B. V.; Limacher, M. C.; Liu, S.; Margolis, K. L.; Martin, L. W.; Paynter, N. P.; Ridker, P. M.; Robinson, J. G.; Rossouw, J. E.; Safford, M. M.; and Manson, J. E. 2018. Evaluation of the Pooled Cohort Risk Equations for Cardiovascular Risk Prediction in a Multiethnic Cohort From the Women’s Health Initiative. JAMA Internal Medicine.
  • [Paszke et al.2017] Paszke, A.; Gross, S.; Chintala, S.; Chanan, G.; Yang, E.; DeVito, Z.; Lin, Z.; Desmaison, A.; Antiga, L.; and Lerer, A. 2017. Automatic differentiation in PyTorch.
  • [Pike et al.2016] Pike, M. M.; Decker, P. A.; Larson, N. B.; St. Sauver, J. L.; Takahashi, P. Y.; Roger, V. L.; Rocca, W. A.; Miller, V. M.; Olson, J. E.; Pathak, J.; and Bielinski, S. J. 2016. Improvement in Cardiovascular Risk Prediction with Electronic Health Records. Journal of Cardiovascular Translational Research 9(3):214–222.
  • [Pleiss et al.2017] Pleiss, G.; Raghavan, M.; Wu, F.; Kleinberg, J.; and Weinberger, K. Q. 2017. On fairness and calibration. In Advances in Neural Information Processing Systems, 5680–5689.
  • [Pylypchuk et al.2018] Pylypchuk, R.; Wells, S.; Kerr, A.; Poppe, K.; Riddell, T.; Harwood, M.; Exeter, D.; Mehta, S.; Grey, C.; and Wu, B. P. 2018. ardiovascular disease risk prediction equations in 400,000 primary care patients in new zealand: a derivation and validation study. The Lancet 391(10133):1897–1907.
  • [Ramdas, Trillos, and Cuturi2017] Ramdas, A.; Trillos, N. G.; and Cuturi, M. 2017. On wasserstein two-sample testing and related families of nonparametric tests. Entropy 19(2).
  • [Rana et al.2016] Rana, J. S.; Tabada, G. H.; Solomon, M. D.; Lo, J. C.; Jaffe, M. G.; Sung, S. H.; Ballantyne, C. M.; and Go, A. S. 2016. Accuracy of the Atherosclerotic Cardiovascular Risk Equation in a Large Contemporary, Multiethnic Population. Journal of the American College of Cardiology 67(18):2118–2130.
  • [Rapsomaniki et al.2014] Rapsomaniki, E.; Shah, A.; Perel, P.; Denaxas, S.; George, J.; Nicholas, O.; Udumyan, R.; Feder, G. S.; Hingorani, A. D.; Timmis, A.; Smeeth, L.; and Hemingway, H. 2014. Prognostic models for stable coronary artery disease based on electronic health record cohort of 102 023 patients. European Heart Journal 35(13):844–852.
  • [Reps et al.2018] Reps, J. M.; Schuemie, M. J.; Suchard, M. A.; Ryan, P. B.; and Rijnbeek, P. R. 2018. Design and implementation of a standardized framework to generate and evaluate patient-level prediction models using observational healthcare data. Journal of the American Medical Informatics Association.
  • [Stone et al.2014] Stone, N. J.; Robinson, J. G.; Lichtenstein, A. H.; Bairey Merz, C. N.; Blum, C. B.; Eckel, R. H.; Goldberg, A. C.; Gordon, D.; Levy, D.; Lloyd-Jones, D. M.; McBride, P.; Schwartz, J. S.; Shero, S. T.; Smith, S. C.; Watson, K.; and Wilson, P. W. F. 2014. 2013 ACC/AHA Guideline on the Treatment of Blood Cholesterol to Reduce Atherosclerotic Cardiovascular Risk in Adults. Circulation 129(25 suppl 2):S1–S45.
  • [Uehara et al.2016] Uehara, M.; Sato, I.; Suzuki, M.; Nakayama, K.; and Matsuo, Y. 2016. Generative adversarial nets from a density ratio estimation perspective. arXiv preprint arXiv:1610.02920.
  • [Verghese, Shah, and Harrington2018] Verghese, A.; Shah, N. H.; and Harrington, R. A. 2018.

    What this computer needs is a physician: Humanism and artificial intelligence.

    JAMA 319(1):19–20.
  • [Xiao, Choi, and Sun2018] Xiao, C.; Choi, E.; and Sun, J. 2018.

    Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review.

    Journal of the American Medical Informatics Association.
  • [Yadlowsky et al.2018] Yadlowsky, S.; Hayward, R. A.; Sussman, J. B.; McClelland, R. L.; Min, Y.-I.; and Basu, S. 2018. Clinical Implications of Revised Pooled Cohort Equations for Estimating Atherosclerotic Cardiovascular Disease Risk. Annals of Internal Medicine 169(1):20.
  • [Zafar et al.2017] Zafar, M. B.; Valera, I.; Rodriguez, M. G.; Gummadi, K. P.; and Weller, A. 2017. From Parity to Preference-based Notions of Fairness in Classification. Advances in Neural Information Processing Systems 229–239.
  • [Zemel et al.2013] Zemel, R. S.; Wu, Y.; Swersky, K.; Pitassi, T.; and Dwork, C. 2013. Learning Fair Representations. Proceedings of the 30th International Conference on Machine Learning 28:325–333.
  • [Zhang, Lemoine, and Mitchell2018] Zhang, B. H.; Lemoine, B.; and Mitchell, M. 2018. Mitigating unwanted biases with adversarial learning. arXiv preprint arXiv:1801.07593.

Supplementary Material

Exclusion criteria: Cardiovascular Artery Diseases

Concept id Label
197303 Monoplegia of dominant lower limb as a late effect of
cerebrovascular accident
312327 Acute myocardial infarction
312938 Hypertensive encephalopathy
313217 Atrial fibrillation
314667 Nonpyogenic thrombosis of intracranial venous sinus
315286 Chronic ischemic heart disease
315296 Preinfarction syndrome
315832 Angina decubitus
316139 Heart failure
316427 Aneurysm of coronary vessels
316437 Cerebral atherosclerosis
316995 Coronary occlusion
317576 Coronary arteriosclerosis
319038 Postmyocardial infarction syndrome
319835 Congestive heart failure
321318 Angina pectoris
321879 Dissecting aneurysm of coronary artery
372654 Paralytic syndrome as late effect of stroke
372924 Cerebral artery occlusion
373503 Transient cerebral ischemia
374055 Basilar artery syndrome
374060 Acute ill-defined cerebrovascular disease
374384 Cerebral ischemia
375557 Cerebral embolism
376714 Vertebrobasilar artery syndrome
378774 Moyamoya disease
381591 Cerebrovascular disease
433505 Subclavian steal syndrome
434056 Late effects of cerebrovascular disease
434376 Acute myocardial infarction of anterior wall
434656 Vertebral artery syndrome
434657 Weakness of face muscles
436706 Acute myocardial infarction of lateral wall
437308 Basilar artery occlusion
437584 Ataxia
438168 Aneurysm of heart
438170 Acute myocardial infarction of inferior wall
438438 Acute myocardial infarction of anterolateral wall
438447 Acute myocardial infarction of inferolateral wall
439295 Multiple and bilateral precerebral arterial occlusion
439693 True posterior myocardial infarction
439846 Left heart failure
440426 Vertigo as late effect of stroke
441579 Acute myocardial infarction of inferoposterior wall
441874 Cerebral thrombosis
443239 Precerebral arterial occlusion
443465 Dysphagia as a late effect of cerebrovascular accident
443525 Monoplegia of dominant upper limb as a late effect of
cerebrovascular accident
443551 Apraxia due to cerebrovascular accident
443563 Arteriosclerosis of coronary artery bypass graft
443580 Systolic heart failure
443587 Diastolic heart failure
443599 Paralytic syndrome of nondominant side as late effect of stroke
443609 Paralytic syndrome of dominant side as late effect of stroke
444406 Acute subendocardial infarction
4043731 Infarction - precerebral
4048785 Vertebrobasilar territory transient ischemic attack
4108356 Cerebral infarction due to embolism of cerebral arteries
4108669 Acute myocardial infarction of atrium
4110192 Cerebral infarction due to thrombosis of cerebral arteries
4162038 Occlusion of artery
4185117 Vertebral artery obstruction
4185932 Ischemic heart disease
4186397 Myocardial ischemia
4288310 Carotid artery obstruction
40479192 Chronic systolic heart failure
40479575 Dysphasia as late effect of cerebrovascular disease
40479576 Chronic diastolic heart failure
40480002 Aphasia as late effect of cerebrovascular disease
40480449 Sensory disorder as a late effect of cerebrovascular disease
40480602 Acute on chronic systolic heart failure
40480603 Acute systolic heart failure
40480938 Monoplegia of lower limb as late effect of cerebrovascular disease
40480946 Monoplegia of nondominant lower limb as a late effect of
cerebrovascular accident
40481042 Acute diastolic heart failure
40481043 Acute on chronic diastolic heart failure
40481132 Arteriosclerosis of coronary artery bypass graft of transplanted
heart
40481354 Speech and language deficit as late effect of
cerebrovascular accident
40481762 Hemiplegia as late effect of cerebrovascular disease
40481842 Monoplegia of upper limb as late effect of cerebrovascular disease
40481919 Coronary atherosclerosis
40482266 Monoplegia of nondominant upper limb as a late effect of
cerebrovascular accident
40482301 Residual cognitive deficit as late effect of cerebrovascular accident
40482638 Arteriosclerosis of autologous vein coronary artery bypass graft
40482655 Arteriosclerosis of nonautologous coronary artery bypass graft
40482727 Combined systolic and diastolic dysfunction
40483189 Arteriosclerosis of arterial coronary artery bypass graft
40484513 Hemiplegia of nondominant side as late effect of cerebrovascular
disease
40484522 Hemiplegia of dominant side as late effect of cerebrovascular
disease
42872402 Coronary arteriosclerosis in native artery
43021821 Coronary arteriosclerosis in native artery of transplanted heart
43530687 Dysarthria as late effects of cerebrovascular disease
43531583 Visual disturbance as sequela of cerebrovascular disease
44782718 Acute combined systolic and diastolic heart failure
44782719 Chronic combined systolic and diastolic heart failure
44782733 Acute on chronic combined systolic and diastolic heart failure
Table S1: List of concepts from the OMOP (Observational Medical Outcomes Partnership) vocabulary listed as cardiovascular artery diseases

Exclusion criteria: Antihypertensive drugs

Concept id Label
C02AA Rauwolfia alkaloids
C02AA01 rescinnamine
C02AA02 reserpine
C02AA05 deserpidine
C02AA06 methoserpidine
C02AB01 methyldopa (levorotatory)
C02AC01 clonidine
C02AC02 guanfacine
C02AC05 moxonidine
C02AC06 rilmenidine
C02BA01 trimetaphan
C02BB01 mecamylamine
C02CA01 prazosin
C02CA02 indoramin
C02CA04 doxazosin
C02CA06 urapidil
C02CC01 betanidine
C02CC02 guanethidine
C02CC04 debrisoquine
C02DA01 diazoxide
C02DB01 dihydralazine
C02DB02 hydralazine
C02DC01 minoxidil
C02DD01 nitroprusside
C02DG01 pinacidil
C02KB01 metirosine
C02KC01 pargyline
C02KD01 ketanserin
C02KX01 bosentan
C02KX02 ambrisentan
C02KX04 macitentan
C02KX05 riociguat
C07AA01 alprenolol
C07AA02 oxprenolol
C07AA03 pindolol
C07AA05 propranolol
C07AA06 timolol
C07AA07 sotalol
C07AA12 nadolol
C07AA14 mepindolol
C07AA15 carteolol
C07AA16 tertatolol
C07AA17 bopindolol
C07AA19 bupranolol
C07AA23 penbutolol
C07AB01 practolol
C07AB02 metoprolol
C07AB03 atenolol
C07AB04 acebutolol
C07AB05 betaxolol
C07AB07 bisoprolol
C07AB08 celiprolol
C07AB09 esmolol
C07AB12 nebivolol
C07AB13 talinolol
C07AG01 labetalol
C07AG02 carvedilol
C07FB02 metoprolol and felodipine
C07FB03 atenolol and nifedipine
Table S2: List of drugs from the ATC (Anatomical Therapeutic Chemical Classification System) vocabulary considered as anti-hypertensive drugs.

Outcome Definition: ASCVD

Concept id Label
312327 Acute myocardial infarction
372924 Cerebral artery occlusion
374060 Acute ill-defined cerebrovascular disease
375557 Cerebral embolism
434376 Acute myocardial infarction of anterior wall
436706 Acute myocardial infarction of lateral wall
437308 Basilar artery occlusion
438170 Acute myocardial infarction of inferior wall
438438 Acute myocardial infarction of anterolateral wall
438447 Acute myocardial infarction of inferolateral wall
439295 Multiple and bilateral precerebral arterial occlusion
439693 True posterior myocardial infarction
441579 Acute myocardial infarction of inferoposterior wall
441874 Cerebral thrombosis
443239 Precerebral arterial occlusion
444406 Acute subendocardial infarction
4043731 Infarction - precerebral
4108356 Cerebral infarction due to embolism of cerebral arteries
4108669 Acute myocardial infarction of atrium
4110192 Cerebral infarction due to thrombosis of cerebral arteries
4185117 Vertebral artery obstruction
4288310 Carotid artery obstruction
Table S3: List of concepts from the OMOP (Observational Medical Outcomes Partnership) vocabulary listed as composing ASCVD