Chasing Your Long Tails: Differentially Private Prediction in Health Care Settings

10/13/2020 ∙ by Vinith M. Suriyakumar, et al. ∙ UNIVERSITY OF TORONTO 0

Machine learning models in health care are often deployed in settings where it is important to protect patient privacy. In such settings, methods for differentially private (DP) learning provide a general-purpose approach to learn models with privacy guarantees. Modern methods for DP learning ensure privacy through mechanisms that censor information judged as too unique. The resulting privacy-preserving models, therefore, neglect information from the tails of a data distribution, resulting in a loss of accuracy that can disproportionately affect small groups. In this paper, we study the effects of DP learning in health care. We use state-of-the-art methods for DP learning to train privacy-preserving models in clinical prediction tasks, including x-ray classification of images and mortality prediction in time series data. We use these models to perform a comprehensive empirical investigation of the tradeoffs between privacy, utility, robustness to dataset shift, and fairness. Our results highlight lesser-known limitations of methods for DP learning in health care, models that exhibit steep tradeoffs between privacy and utility, and models whose predictions are disproportionately influenced by large demographic groups in the training data. We discuss the costs and benefits of differentially private learning in health care.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The potential for machine learning to learn clinically relevant patterns in health care has been demonstrated across a wide variety of tasks (Tomašev et al., 2019; Gulshan et al., 2016; Wu et al., 2019; Rajkomar et al., 2018b). However, machine learning models are susceptible to privacy attacks (Shokri et al., 2017; Fredrikson et al., 2015) that allow malicious entities with access to these models to recover sensitive information, e.g., HIV status or zip code, of patients who were included in the training data. Others have shown that anonymized electronic health records (EHR) can be re-identified using simple “linkages” with public data (Sweeney, 2015), and that neural models trained on EHR are susceptible to membership inference attacks (Shokri et al., 2017; Jordon et al., 2020).

Differential privacy (DP) has been proposed as a leading technique to minimize re-identification risk through linkage attacks (Narayanan and Shmatikov, 2008; Dwork et al., 2017), and is being used to collect personal data by the 2020 US Census (Hawes, 2020), user statistics in iOS and MacOS by Apple (Tang et al., 2017), and Chrome Browser data by Google (Nguyên et al., 2016). DP is an algorithm-level guarantee used in machine learning (Dwork et al., 2006)

, where an algorithm is said to be differentially private if its output is statistically indistinguishable when applied to two input datasets that differ by only one record in the dataset. DP learning focuses with increasing intensity on learning the “body” of a targeted distribution as the desired level of privacy increases. Techniques such as differentially private stochastic gradient descent (DP-SGD) 

(Abadi et al., 2016) and objective perturbation (Chaudhuri et al., 2011; Neel et al., 2019) have been developed to efficiently train models with DP guarantees, but introduce a privacy-utility tradeoff (Geng et al., 2020)

. This tradeoff has been well-characterized in computer vision 

(Papernot et al., 2020), and tabular data  (Shokri et al., 2017; Jayaraman and Evans, 2019) but have not yet been characterized in health care datasets. Further, DP learning has asymptotic theoretical guarantees about robustness that have been established (Nissim and Stemmer, 2015; Jung et al., 2019), but privacy-robustness tradeoffs have not been evaluated in health care settings. Finally, more “unique” minority data may not be well-characterized by DP, leading to a noted privacy-fairness tradeoff in vision (Bagdasaryan et al., 2019; Farrand et al., 2020) and natural language settings (Bagdasaryan et al., 2019).

To date there has not been a robust characterization of utility, privacy, robustness, and fairness tradeoffs for DP models in health care settings. Patient health and care are often highly individualized with a heavy “tail” due to the complexity of illness and treatment variation (Hripcsak et al., 2016), and any loss of model utility in a deployed model is likely to hinder delivered care (Topol, 2019). Privacy-robustness tradeoffs may also be high cost in health care, as data is highly volatile and variant, evolving quickly in response to new conditions (Cohen et al., 2020), clinical practice shifts (Herrera-Perez et al., 2019), and underlying EHR systems changing (Nestor et al., 2019). Privacy-fairness tradeoffs are perhaps the most pernicious concern in health care as there are well-documented prejudices in health care (Chen et al., 2020a). Importantly, the data of patients from minority groups also often lie even further in data tails because lack of access to care can impact patients’ EHR presence (Ferryman and Pitcan, 2018), and leads to small sample sizes of non-white patients (Chen et al., 2018).

In this work, we investigate the feasibility of using DP methods to train models for health care tasks. We characterize the impact of DP in both linear and neural models on 1) accuracy, 2) robustness, and 3) fairness. First, we establish the privacy-utility tradeoffs within two health care datasets (NIH Chest X-Ray data (Wang et al., 2017), and MIMIC-III EHR data (Johnson et al., 2016)

) as compared to two vision datasets (MNIST 

(LeCun et al., 2010) and Fashion-MNIST (Xiao et al., 2017)). We find that DP models have severe privacy-utility tradeoffs in the MIMIC-III EHR setting, using three common tasks — mortality, long-length of stay (LOS), and an intervention (vasopressor) onset (Harutyunyan et al., 2017; Wang et al., 2019). Second, we investigate the impact of DP on robustness to dataset shifts in EHR data. Because medical data often contains dataset shifts over time (Ghassemi et al., 2018), we create a realistic yearly model training scenario and evaluate the robustness of DP models under these shifts. Finally, we investigate the impact of DP on fairness in two ways: loss of performance and loss of influence. We define loss of performance through the standard, and often competing, group fairness metrics (Hardt et al., 2016; Kearns et al., 2018, 2019) of performance gap, parity gap, recall gap, and specificity gap. We examine fairness further by looking at loss of minority data importance with influence functions (Koh and Liang, 2017). In our audits, we focus on loss of population minority influence, e.g., importance of Black patient data, and label minority influence, e.g, importance of positive class patient data, across levels of privacy (low to high) and levels of dataset shift (least to most malignant).

Across our experiments we find that DP learning algorithms are not well-suited for off-the-shelf use in health care. First, DP models have significantly more severe privacy-utility tradeoffs in the MIMIC-III EHR setting, and this tradeoff is proportional to the size of the tails in the data. This tradeoff holds even in larger datasets such as NIH Chest X-Ray. We further find that DP learning does not increase model robustness in the presence of small or large dataset shifts, despite theoretical guarantees (Jung et al., 2019). Finally, we do not find a significant drop in standard group fairness definitions, unlike other domains (Bagdasaryan et al., 2019), likely due to the dominating effect of utility loss. We do, however, find a large drop in minority class influence. Specifically, we show that Black training patients lose “helpful” influence on Black test patients. Finally, we outline a series of open problems that future work should address to make DP learning feasible in health care settings.

1.1. Contributions

In this work, we evaluate the impact of DP learning on linear and neural networks across three tradeoffs: privacy-utility, privacy-robustness and privacy-fairness. Our analysis contributes to a call for ensuring that privacy mechanisms equally protect all individuals 

(Ekstrand et al., 2018). We present the following contributions:

  • [leftmargin=*]

  • Privacy-utility tradeoffs scale sharply with tail length.

    We find that DP has particularly strong tradeoffs as tasks have fewer positive examples, resulting in unusable classifier performance. Further, increasing the dataset size does not improve utility tradeoffs in our health care tasks.

  • There is no correlation between privacy and robustness in EHR shifts. We show that DP generally does not improve shift robustness, with the mortality task as one exception. Despite this, we find no correlation between increasing privacy and improved shift robustness in our tasks, most likely due to the poor utility tradeoffs.

  • DP gives unfair influence to majority groups that is hard to detect with standard measures of group fairness. We show that increasing privacy does not result in disparate impact for minority groups across multiple protected attributes and standard group fairness definitions because the privacy-utility tradeoff is so extreme. We use influence functions to demonstrate that the inherent group privacy property of DP results in large losses of influence for minority groups across patient class label, and patient ethnicity labels.

max width= Dataset Data Type Outcome Variable Classification Task Tail Size Protected Attributes Evaluation health care mimic_mortality Time Series in-ICU mortality 21,877 (24,69) Binary Large Ethnicity U,R, F mimic_los_3 Time Series length of stay ¿ 3 days 21,877 (24,69) Binary Small Ethnicity U,R, F mimic_intervention Time Series vasopressor administration 21,877 (24,69) Multiclass (4) Small Ethnicity U,R, F NIH_chest_x_ray Imaging multilabel disease prediction 112,120 (256,256) Multiclass multilabel (14) Largest Sex U,F Vision Baselines mnist Imaging number classification 60,000 (28,28) Multiclass (10) None N/A U fashion_mnist Imaging clothing classification 60,000 (28,28) Multiclass (10) None N/A U

Table 1. We analyze tradeoffs in two vision baseline datasets and two health care datasets. We use three prediction tasks in MIMIC-III with different tail sizes and focus our utility (U), robustness (R), and fairness (F) analyses on these tasks. Finally, we choose NIH Chest X-Ray which is a larger dataset with the largest tail to examine whether increasing the dataset size has an impact on utility and fairness tradeoffs.

2. Related Work

2.1. Differential Privacy

DP provides much stronger privacy guarantees over methods such as k-anonymity (Sweeney, 2002) and t-closeness (Li et al., 2007), to a number of privacy attacks such as reconstruction, tracing, linkage, and differential attacks (Dwork et al., 2017). The outputs of DP analyses are resistant to attacks based on auxiliary information, meaning they cannot be made less private (Dwork et al., 2014). Such benefits have made DP a leading method for ensuring privacy in consumer data settings (Hawes, 2020; Tang et al., 2017; Nguyên et al., 2016). Further, theoretical analyses have demonstrated improved generalization guarantees for out of distribution examples (Jung et al., 2019), but there has been no empirical analysis of DP model robustness, e.g., in the presence of dataset shift. Other theoretical analyses demonstrate that a model that is both private and approximately fair can exist in finite sample access settings. However, they show that it is impossible to achieve DP and exact fairness with non-trivial accuracy (Cummings et al., 2019). This is empirically shown in DP-SGD which has disparate impact on complex minority groups in vision and NLP (Bagdasaryan et al., 2019; Farrand et al., 2020).

Differential Privacy in Health Care

Prior work on DP in machine learning for health care has focused on the distributed setting, where multiple hospitals collaborate to learn a model (Beaulieu-Jones et al., 2018; Pfohl et al., 2019). This work has shown that DP learning leads to a loss in model performance defined by area under the receiver operator characteristic (AUROC). We instead focus on analyzing the tradeoffs between privacy, robustness, and fairness, with an emphasis on the impact that DP has on subgroups.

2.2. Utility, Robustness, and Fairness in Health Care

Utility Needs in Health Care Tasks

Machine learning in health care is intended to support clinicians in their decision making, which suggests that models need to perform similarly to physicians (Davenport and Kalakota, 2019). The specific metric is dependent on the the task as high positive predictive value may be preferred over high negative predictive value (Kelly et al., 2019). In this work, we focus on predictive accuracy as AUROC and AUPRC, characterizing this loss as privacy levels increase.

Robustness to Dataset Shift

The effect of dataset shift has been studied in non-DP health care settings, demonstrating that model performance often deteriorates when the data distribution is non-stationary (Jung and Shah, 2015; Davis et al., 2017; Subbaswamy et al., 2018). Recent work has demonstrated that performance deteriorates rapidly on patient LOS and mortality prediction tasks in the MIMIC-III EHR dataset, when trained on past years, and applied to a future year (Nestor et al., 2019). We focus on this setting for a majority of our experiments, leveraging year-to-year changes in population as small dataset shifts, and a change in EHR software between 2008 and 2009 as a large dataset shift.

Group Fairness

Disparities exist between white and Black patients, resulting in health inequity in the U.S.A (Orsi et al., 2010; Obermeyer et al., 2019). Further, even the use of some sensitive data like ethnicity in medical practice is contentious (Vyas et al., 2020)

, and has been called into question in risk scores, for instance in estimating kidney function 

(Martin, 2011; Eneanya et al., 2019).

Much work has described the ability of machine learning models to exacerbate disparities between protected groups (Chen et al., 2018); even state-of-the-art chest X-Ray classifiers demonstrate diagnostic disparities between sex, ethnicity, and insurance type (Seyyed-Kalantari et al., 2020). We leverage recent work in measuring the group fairness of machine learning models for different statistical definitions (Hardt et al., 2016)

in supervised learning.

We complement these standard metrics by also examining loss of data importance through influence functions (Koh and Liang, 2017); influence functions have also been extended to approximate the effects of subgroups on a model’s prediction (Koh et al., 2019). They demonstrate that memorization is required for small generalization error on long tailed distributions (Feldman, 2020).

3. Data

Details of each data source and prediction task are shown in Table 1. The four datasets are intentionally of different sizes, with respective tasks that represent distributions with and without long tails.

3.1. Vision Baselines

We use MNIST (LeCun et al., 2010) and FashionMNIST (Xiao et al., 2017) to demonstrate the benchmark privacy-utility tradeoffs in non-health settings with no tails. We use the NIH Chest X-Ray dataset (Wang et al., 2017) (112,120 images, details in Appendix B.2) to benchmark privacy-utility tradeoffs in a medically based, but still vision-focused, setting with the largest tails of all of our tasks.

3.2. MIMIC-III Time Series EHR Data

For the remainder of our analyses on privacy-robustness and privacy-fairness, we use the MIMIC-III database (Johnson et al., 2016)—a publicly available anonymized EHR dataset of intensive care unit (ICU) patients (21,877 unique patient stays, details in Appendix B.1). We focus on two binary prediction tasks of predicting (1) ICU mortality (class imbalanced), (2) LOS greater than 3 days (class balanced) and choose one multiclass prediction tasks of predicting intervention onset for (3) vasopressor administration (class balanced) (Harutyunyan et al., 2017; Wang et al., 2019).

Source of Distribution Shift

In MIMIC-III, there is a known source of dataset shift after 2008 due to a transition in the EHR used (66). There are also smaller shifts in non-transition years as the patient distribution is non-stationary (Nestor et al., 2019).

4. Methodology

We use both DP-SGD and objective perturbation across three different privacy levels to evaluate the impact that DP learning has on utility and robustness to dataset shift. Given the worse utility and robustness tradeoffs using objective perturbation, we focus our subsequent fairness analyses on DP-SGD in health care settings.

4.1. Model Classes

Vision Baselines

We use different convolutional neural network architectures for the MNIST and FashionMNIST prediction tasks based on prior work 

(Papernot et al., 2020)

. We use DenseNet-121 pretrained on ImageNet for the NIH Chest X-Ray experiments 

(Seyyed-Kalantari et al., 2020).


For the MIMIC-III health care tasks analyses, we choose one linear model and one neural network per task, based on the best baselines, trained without privacy, outlined in prior work creating benchmarks for the MIMIC-III dataset (Wang et al., 2019)

. For binary prediction tasks we use logistic regression (LR)

(Cox, 1972)

and gated recurrent unit with decay (GRU-D) 

(Che et al., 2018). For our multiclass prediction task, we use LR and 1D convolutional neural networks.

max width= Vision Baselines Dataset Model None () Low () High () MNIST CNN FashionMNIST CNN MIMIC-III Task Model None () Low () High () Mortality LR GRUD Length of Stay ¿ 3 LR GRUD Intervention Onset (Vaso) LR CNN NIH Chest X-Ray Metric Model None () Low () High () Average AUC DenseNet-121 Best AUC DenseNet-121 (Hernia) (Edema) (Pleural Thickening) Worst AUC DenseNet-121 (Infiltration) (Fibrosis) (Pleural Thickening)

Table 2. Health care tasks have a significant tradeoff between the High and Low or None setting. The tradeoff is better in tasks with small tails (length of stay and intervention onset), and worst in tasks such as mortality and NIH Chest X-Ray with long tails. We provide the guarantees in parentheses, where represents the privacy loss (lower is better) and

represents the probability that the guarantee does not hold (lower is better).

4.2. Differentially Private Training

We train models without privacy guarantees using stochastic gradient descent (SGD). DP models are trained with DP-SGD (Abadi et al., 2016), which is the de-facto approach for both linear models and neural networks. We choose not to train models using PATE (Papernot et al., 2016)

, because it requires access to public data for semi-supervised learning and this is unrealistic in health care settings. In the Appendix, we provide results for models trained using objective perturbation 

(Chaudhuri et al., 2011; Neel et al., 2019) which provides -DP. It is only applicable to our linear models. We focus on DP-SGD due to its more optimal theoretical guarantees (Jagielski et al., 2020) regarding privacy-utility tradeoffs, and objective perturbation’s limited applicability to linear models. The modifications made to SGD involve clipping gradients computed on a per-example basis to have a maximum norm, and then adding Gaussian noise to these gradients before applying parameter updates (Abadi et al., 2016) (Appendix E.1).

We choose three different levels of privacy to measure the effect of increasing levels of privacy by varying levels of epsilon. We selected these levels based on combinations of the noise level, clipping norm, number of samples, and number of epochs. Our three privacy levels are: None, Low (Clip Norm = 5.0, Noise Multiplier = 0.1), and High (Clip Norm = 1.0, Noise Multiplier = 1.0). We provide a detailed description of training setup in terms of hyperparameters and infrastructure in Appendix 


4.3. Privacy Metrics

We measure DP using the bound derived analytically using the Renyi DP accountant for DP-SGD. Larger values of reflect lower privacy. Note that the privacy guarantees reported for each model are underestimates because they do not include the privacy loss due to hyperparameter seearches (Chaudhuri and Vinterbo, 2013; Liu and Talwar, 2019).

5. Privacy-Utility Tradeoffs

We analyze the privacy-utility tradeoff by training linear and neural models with DP learning. We analyze performance across three privacy levels for the vision, MIMIC-III and NIH Chest X-Ray datasets. The privacy-utility tradeoffs for these datasets and tasks have not been characterized yet. Our work provides a benchmark for future work on evaluating DP learning.

Experimental Setup

We train both linear and neural models on the tabular MIMIC-III tasks. We train deep neural networks on NIH Chest X-Ray image tasks and the vision baseline tasks. We first analyze the effect that increased tail length in MIMIC-III has on the privacy-utility tradeoff. Next, we compare whether linear or neural models have better privacy-utility tradeoffs. Finally, we use the NIH Chest X-Ray dataset to evaluate if increasing dataset size, while keeping similar tail sizes, results in better tradeoffs.

Time Series Utility Metrics

For MIMIC-III, we average the model AUROC across all shifted test sets to quantitatively measure the utility tradeoff. We measure the privacy-utility tradeoff based on the difference in performance metrics as the level of privacy increases. The average performance across years is used because it incorporates the performance variability between each of the years due to dataset shift. Results for AUPRC for MIMIC-III can be found in Appendix H.1. Both our AUROC and AUPRC results show extreme utility tradeoffs in health care tasks. Both metrics are commonly used to evaluate clinical performance of diagnostic tests (Hajian-Tilaki, 2013).

Imaging Utility Metrics

For the NIH Chest X-Ray experiments, the task we experiment on is multiclass multilabel disease prediction. We average the AUROC across all 14 disease labels. For the MNIST and FashionMNIST vision baselines, the task we experiment on is multiclass prediction (10 labels for both) where we evaluate using accuracy.

Figure 1. The effect of DP learning on robustness to non-stationarity and dataset shift. One instance of increased robustness in the 2009 column for mortality prediction in the high privacy setting (A), but this does not hold across all tasks and models. Performance drops in the 2009 column for LOS in both LR and GRU-D (B), and a much worse drop in the high privacy CNN for intervention prediction (C).

5.1. Health Care Tasks Have Steep Utility Tradeoffs

We compare the privacy-utility tradeoffs in Table 2. DP-SGD generally has a negative impact on model utility. The extreme tradeoffs in MIMIC-III mortality prediction, and NIH Chest X-Ray diagnosis exemplify the information DP-SGD looses from the tails, because the positive cases are in the long tails of the distribution. There is a 22% and 26% drop in the AUROC between no privacy and high privacy settings for mortality prediction for LR and GRUD respectively. There is a 35% drop in AUROC between the no privacy and high privacy settings for the NIH Chest X-Ray prediction task which has a much longer tail than mortality prediction. Our results for objective perturbation show worse utility tradeoffs than those presented by DP-SGD (Appendix G.1).

5.2. Linear Models Have Better Privacy-Utility Tradeoffs

Across all three prediction tasks in the MIMIC-III dataset we find that linear models have better tradeoffs in the presence of long tails. This is likely due to two issues: small generalization error in neural networks often requires memorization in long tail settings (Zhang et al., 2016; Feldman, 2020)

and gradient clipping introduces more bias as the number of model parameters increases 

(Chen et al., 2020b; Song et al., 2020).

5.3. Larger Datasets Do Not Achieve Better Tradeoffs

Theoretical analyses show that privacy-utility tradeoff can be improved with larger datasets (Vadhan, 2017). We find that the NIH Chest X-Ray dataset also has extreme tradeoffs. Despite its larger size, the dataset’s positive labels in long tails are similarly lost.

6. Privacy-Robustness Tradeoffs

A potential motivation for using DP despite extreme utility tradeoffs are the recent theoretical robustness guarantees (Jung et al., 2019). We investigate the impact of DP to mitigating dataset shift for time series MIMIC-III tasks by analyzing model performance across years of care. We first record generalization as the difference in performance when a model is trained and tested on data drawn from , versus performance on a shifted test set drawn from ) and the malignancy of the shift. We then measure the malignancy of the yearly shifts using a domain classifier. Finally we perform a Pearsons correlation test (Stigler, 1989) between the model’s generalization capacity and the shift malignancy.

Experimental Setup

We analyze the robustness of DP models to dataset shift in the MIMIC-III health care tasks. We use year-to-year variation in hospital practices as a small shifts, and a change in EHR software between 2008-2009 as a source of major dataset shift. We define robustness as the difference in test accuracy between in-distribution and out-distribution data. For instance, to measure model robustness from the 2006 to 2007, we would 1) train a model on data from 2006, 2) test the model on data from 2006, and 3) test the same model on data from 2007. The difference in these two test accuracies is the 2006-2007 model robustness.

Figure 2.

DP bounds the individual influence of training patients on the loss of test patients (A) which improved robustness for mortality prediction between the least malignant shift in 2007 (B) and the most malignant in 2009 (C). Individual influence of training data in the no privacy setting on 100 test patients with highest influence variance. Each column on the x-axis is an individual test patient. A unique colour is plotted per column/test patient for ease of assessment. The influence value of each patient in the training set on a specific test point is plotted as a point in that patient’s column. Influence of training points is bounded in the high privacy setting (red dotted line).

Robustness Metrics

To measure the impact of DP-SGD on robustness to dataset shift, we measure the malignancy of yearly shifts from 2002 to 2012 for the MIMIC-III dataset. We then measure the correlation between malignancy of yearly shift and model performance. As done by others we use We use a binary domain classifier (model class is chosen best on data type) trained to discriminate between in-domain and out-domain . The malignancy of the dataset shift is proportional to how difficult it is to train on and perform well in  (Rabanser et al., 2019). Other methods such as multiple univariate hypothesis testing or multivariate hypothesis testing assume that the data is i.i.d (Rabanser et al., 2019). A full procedure is given in Appendix A, with complete significance and malignancies for each year in Appendix D.

6.1. DP Does Not Impart Robustness to Yearly EHR Data Shift

While we expect that DP will be more robust to dataset shift across all tasks and models, we find that model performance drops when the EHR shift occurs (2008-2009) across all privacy levels and tasks (Fig. 1). We note one exception: high privacy models are more stable in the mortality task during more malignant shifts (2007-2009) (Fig. 1).111We did not observe this improvement when training with objective perturbation (Appendix G.2). Despite this, we find that there are no significant correlations between model robustness and privacy level (Table 13).

Our analyses find that the robustness guarantees that DP provides do not hold in a large, tabular EHR setting. We note that the privacy-utility tradeoff from Section 5 is too extreme in health care to conclusively understand the effect on model robustness.

7. Privacy-Fairness Tradeoffs

Prior work has demonstrated that DP learning has disparate impact on complex minority groups in vision and NLP (Bagdasaryan et al., 2019). We expect similar disparate impacts on patient minority groups in the MIMIC-III and NIH Chest X-Ray datasets, based on known disparities in treatment and health care delivery (Orsi et al., 2010; Obermeyer et al., 2019). We evaluate disparities based on four standard group fairness definitions, and on loss of minority patient influence.

We focus on the disparities between white and Black patients in MIMIC-III, based on prior work showing classifier variation in the low number of Black patients (Chen et al., 2018). We focus on male and female patients in NIH Chest X-Ray based on prior work exposing disparities in chest x-ray classifier performance between these two groups (Seyyed-Kalantari et al., 2020).

max width= Privacy Level Average Survived Influence Average Died Influence Most Helpful Group Most Harmful Group Influence None Died Survived Low Survived Survived High Survived Survived

Table 3. Group influence summary statistics of training data by class label in all privacy levels for all test patients. Privacy changes the most helpful group the patients who died (minority) to the patients who survived (majority). DP learning minimizes the helpful influence of minority groups resulting in worse utility.
Figure 3. Group influence of training data by class label in no privacy (A) and high privacy (B) settings on 100 test patients with highest influence variance. In the no privacy setting, patients who died have a helpful influence despite being a minority class. High privacy gives the majority group the most influence due to the group privacy guarantee.

max width= White Test Patients Privacy Level Average White Influence Average Black Influence Most Helpful Ethnicity Most Harmful Ethnicity None White White Low White White High White White Black Test Patients Privacy Level Average White Influence Average Black Influence Most Helpful Ethnicity Most Harmful Ethnicity None Black White Low White White High White White

Table 4. Group influence summary statistics across all privacy levels for white (majority) and Black (minority) training patients on both white and Black test patients in MIMIC-III. Privacy changes the most helpful group from Black patients to the majority white patients and minimizes their helpful influence. This needs careful consideration as the use of ethnicity is still being investigated in medical practice.
Group Fairness Experimental Setup and Metrics

We measure fairness according to four standard group fairness definitions: performance gap, parity gap, recall gap, and specificity gap (Hardt et al., 2016). The performance gap for our health care tasks is the difference in AUROC between the selected subgroups. The remaining three definitions of fairness for binary prediction tasks are presented in Appendix A.3.

Influence Experimental Setup and Metrics

We use influence functions to measure the relative influence of training points on test set performance (equations in Appendix A.4). Influences above 0 are helpful in minimizing the test loss for the test patient in that column, and influences below 0 are harmful in minimizing the test loss for that patient. Our influence function method (Koh and Liang, 2017)

assumes a smooth, convex loss function with respect to model parameters, and is therefore only valid for LR. We focus group privacy analyses on the LR model in no and high privacy settings for mortality prediction.

First, we aim to confirm that the gradient clipping in DP-SGD bounds the influence of all training patients on the loss of all test patients. For the utility tradeoff, we measure the group influence that the training patients of each label group has on the loss of each test patient. For the robustness tradeoff, we measure the individual influence of all training patients on the loss of test patients between the least malignant and most malignant dataset shifts. Finally, we measure the group influence of training patients in each ethnicity on the white test patients and Black test patients separately.

7.1. DP Has No Impact on Group Fairness on Average, But Reduces Variance Over Time

To measure the average fairness gap in MIMIC-III, we average group fairness measures across all years of care. In the NIH Chest X-ray data we average across all disease labels.

We find that DP-SGD has little impact on any tested fairness definitions in both MIMIC-III (Table 14) and NIH Chest X-Ray, likely due to the high privacy-utility tradeoff. DP-SGD does confer lower variance in fairness measures on MIMIC-III tasks over time (Appendix H.3.1).

7.2. DP Learning Gives Unfair Influence to Majority Groups

We find that DP-SGD reduces the influence of all training points on individual test points ( Fig 2) because gradient clipping tightly bounds influence of all training points across test points.

Influence-Utility Tradeoff

We find the worst privacy-utility tradeoff in the mortality task. Non-DP models find the patients who died to be the most helpful in predictions of mortality (Fig. 3 and Table 3). However, because positive labels, i.e., death, are rare, DP models focus influence on patients who survived, resulting in unfair over-influence.

Influence-Robustness Tradeoff

We see improved robustness in LR for the mortality task which has the most malignant dataset shift (2008-2009) ( Figure 1).

We find that the variance of the influence is fairly low for non-DP models during lower malignancy shifts. In more malignant shifts, the variance of the influence is high with many training points being harmful (Fig. 2). This is likely due to gradient clipping reducing influence variance and is entangled with the poor privacy-utility tradeoff H.2.

Influence-Fairness Tradeoff

We approximate the collective group influence of different ethnicities in the training set on the test loss in Fig. 10 and Table 4. We show that group privacy results in white patients having a more significant influence, both helpful and harmful, on test patients in the high privacy setting.

8. Discussion

8.1. On Utility, Robustness and Trust in Clinical Prediction Tasks

Poor Utility Impacts Trust

While some reduced utility in long tail tasks are known (Feldman, 2020), the extreme tradeoffs that we observe in Table 2 are much worse than expected. Machine learning can only support the decision making processes of clinicians if there is clinical trust. If models do not perform well as, or better than, clinicians once we include privacy, there is little reason to trust them (Topol, 2019).

Importance of Model Robustness

Despite the promising theoretical transferability guarantees of DP, the results in Fig 1 and Table H.2 demonstrate these do not transfer in our health care setting. While we explored changes in EHR software as dataset shift, there are many other known shifts in healthcare data, e.g., practice modifications due to reimbursement policy changes (Kocher et al., 2010), or changing clinical needs in public health emergencies such as COVID-19 (Cohen et al., 2020). If models do not maintain their utility after dataset shifts, catastrophic silent failures could occur in deployed models (Nestor et al., 2019).

8.2. Majority Group Influence Is Harmful in Health Care

We show in Figure 3 that the tails of the label distribution are minority-rich, results in poor mortality prediction performance under DP. Prior work in evaluating model fairness in health care has focused on standard group fairness definitions (Pfohl et al., 2020). However, these definitions do not provide a detailed understanding of model fairness under reduced utility. Other work has shown that large utility loss can “wash out” fairness impacts (Farrand et al., 2020). Our work demonstrates that DP learning does harm group fairness in such “washed out” poor utility settings by giving majority groups (e.g., those that survived, and white patients) the most influence on predictions across all subgroups.

Why Influence Matters

Disproportionate assignment of influence is an important problem. Differences in access, practice, or recording reflect societal biases (Rajkomar et al., 2018a; Rose, 2018), and models trained on biased data may exhibit unfair performance in populations due to this underlying variation (Chen et al., 2019). Further, while patients with the same diagnosis are usually more helpful for estimating prognosis in practice (Croft et al., 2015), labels in health care often lack precision or, in some cases, may be unreliable (O’malley et al., 2005). In this setting, understanding what factors are consistent high-influence in patient phenotypes is an important task (Halpern et al., 2016; Yu et al., 2017).

Loss Of Black Influence

Ethnicity is currently used in medical practice as a factor in many risk scores, where different risk profiles are assumed for patients of different races (Martin, 2011). However, the validity of this stratification has recently been called into question by the medical community (Eneanya et al., 2019). Prior work has established the complexity of treatment variation in practice, as patient care plans are highly individualized, e.g., in a cohort of 250 million patients, 10% of diabetes and depression patients and almost 25% of hypertension patients had a unique treatment pathway (Hripcsak et al., 2016). Thus having the white patients become the most influential in Black patients predictions may not be desirable.

Anchoring Influence Loss in Systemic Injustice

Majority over-influence is prevalent in medical settings, and has direct impact on the survival of patients. Many female and minority patients receive worse care and have worse outcomes because clinicians base their symptomatic evaluations on white and/or male patients (Greenwood et al., 2018, 2020). Further, randomized control trials (RCTs) are an important tool that 10-20% of treatments are based on (McGinnis et al., 2013). However, prior work has shown that RCTs have notorious exclusive criteria for inclusion; in one salient example, only 6% of asthmatic patients would have been eligible to enroll in the RCT that resulted in their treatments (Travers et al., 2007). RCTs tend to be comprised of white, male patients, resulting in their data determining what is an effective treatment (Heiat et al., 2002). By removing influence from women, Hispanics, and Blacks, naive machine learning practices can exacerbate systemic injustices (Chen et al., 2020a).

There are ongoing efforts to improve representation of the population in RCTs, shifting away from the majority having majority influence on treatments (Stronks et al., 2013). Researchers using DP should follow suit, and work to reduce the disparate impact on influence to ensure that it does not perpetuate this existing bias in health care. One solution is to start measuring individual example privacy loss (Feldman and Zrnic, 2020) instead of a conservative worst bound across all patients. Currently, DP-SGD uses constant gradient clipping for all examples to ensure this constant worst bound. Instead, individual privacy accounting can help support adaptive gradient clipping for each example which may help to reduce the disparate impact DP-SGD has on influence. We also encourage future privacy-fairness tradeoff analyses to include loss of influence as a standard metric, especially where the utility tradeoff is extreme.

8.3. Less Privacy In Tails Is Not an Option

The straightforward solution to the long tail issue is to “provide less or no privacy for the tails” (Kearns et al., 2015). This solution could amplify existing systemic biases against minority subgroups, and minority mistrust of medical institutions. For example, Black mothers in the US are most likely to be mistreated, dying in childbirth at a rate three times higher than white women (Berg et al., 1996). In this setting, it is not ethical choose between a “non-private” prediction that will potentially leak unwanted information, e.g., prior history of abortion, and a “private” prediction that will deliver lower quality care.

8.4. On the Costs and Benefits of Privacy in Health Care

Privacy Issues With Health Care Data

Most countries have regulations that define the protective measures to maintain patient data privacy. In North America, these laws are defined by the Health Insurance Portability and Accountability Act (HIPAA) (Act, 1996) in the US and Personal Information Protection and Electronic Documents Act (PIPEDA) (Act, 2000) in Canada. These laws are governed by the General Data Protection Regulation (GDPR) in the EU. Recent work has shown that HIPAA’s privacy regulations and standards such as anonymizing data are not sufficient to prevent advanced re-identification of data (Na et al., 2018)

. In one instance, researchers were able to re-identify individuals’ faces from MRIs using facial recognition software 

(Schwarz et al., 2019). Privacy attacks such as these demonstrate the fear of health care data loss.

Who Are We Defending Against?

While there are potential concerns for data privacy, it is important realize that privacy attacks assume a powerful entity with malicious purposes (Dwork et al., 2017). Patients are often not concerned when their doctors, or researchers, have access to medical data (Kalkman et al., 2019; Ghafur et al., 2020). However, there are concerns that private, for-profit corporations may purchases health care data that they can easily de-anonymize and link to information collected through their own products. Such linkages could result in raised insurance premiums (Beaulieu-Jones et al., 2018), and unwanted targeted advertising. Recently, Google and University of Chicago Medicine department faced a lawsuit from a patient due to his data being shared in a research partnership between the two organizations (Landi, 2020).

Setting a different standard for dataset release to for-profit entities could be one solution. This allows clinical entities and researchers to make use of full datasets without extreme tradeoffs, while addressing privacy concerns.

8.5. Open Problems for DP in Health Care

While health care has been cited as an important motivation for the development of DP (Papernot et al., 2020; Chaudhuri et al., 2011; Dwork et al., 2014, 2006; Wu et al., 2017; Vietri et al., 2020), our work demonstrates that it is not currently well-suited to these tasks. The theoretical assumptions of DP learning apply in extremely large collection settings, such as the successful deployment of DP US Census data storage. We highlight potential areas of research that both the DP and machine learning communities should focus on to make DP usable in health care data:

  1. [leftmargin=*]

  2. Adaptive and Personalized Privacy Accounting Many of the individuals in the body of a distribution do not end up spending as much of the privacy budget than individuals in the tails. Current DP learning methods do not account for this and simply take a constant, conservative worst case bound for everyone. Improved accounting that can give tails more influence through methods such as adaptive clipping can potentially improve the utility and fairness tradeoff.

  3. Auditing DP Learning in Health Care Currently, ideal values for the guarantee are below 100 but these are often unattainable when trying to maintain utility as we demonstrate in our work. Empirically DP-SGD provides much strong guarantees against privacy attacks than those derived analytically (Jagielski et al., 2020). Developing a suite of attacks for health care settings that can provide similar empirical measurement would complement analytical guarantees nicely. It would provide decision makers more realistic information about what value they actually need in health care.

9. Conclusion

In this work, we investigate the feasibility of using DP-SGD to train models for health care prediction tasks. We find that DP-SGD is not well-suited to health care prediction tasks in its current formulation. First, we demonstrate that DP-SGD loses important information about minority classes (e.g., dying patients, minority ethnicities) that lie in the tails of the dat distribution. The theoretical robustness guarantees of DP-SGD do not apply to the dataset shifts we evaluated. We show that DP learning disparately impacts group fairness when looking at loss of influence for majority groups. We show this disparate impact occurs even when standard measures of group fairness show no disparate impact due to poor utility. This imposed asymmetric valuation of data by the model requires careful thought, because the appropriate use of class membership labels in medical settings in an active topic of discussion and debate. Finally, we propose open areas of research to improve the usability of DP in health care settings. Future work should target modifying DP-SGD, or creating novel DP learning algorithms, that can learn from data distribution tails effectively, without compromising privacy.


We would like to acknowledge the following funding sources: New Frontiers in Research Fund - NFRFE-2019-00844. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute Thank you to the MIT Laboratory of Computational Physiology for facilitating year of care access to the MIMIC-III database. Finally, we would like to thank Nathan Ng, Taylor Killian, Victoria Cheng, Varun Chandrasekaran, Sindhu Gowda, Laleh Seyyed-Kalantari, Berk Ustun, Shalmali Joshi, Natalie Dullerud, Shrey Jain, and Sicong (Sheldon) Huang for their helpful feedback.


  • M. Abadi, A. Chu, I. Goodfellow, H. B. McMahan, I. Mironov, K. Talwar, and L. Zhang (2016) Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, New York, NY, USA, pp. 308–318. Cited by: §F.1, §1, §4.2.
  • A. Act (1996) Health insurance portability and accountability act of 1996. Public law 104, pp. 191. Cited by: §8.4.
  • P. Act (2000) Personal information protection and electronic documents act. Department of Justice, Canada. Full text available at http://laws. justice. gc. ca/en/P-8.6/text. html. Cited by: §8.4.
  • E. Bagdasaryan, O. Poursaeed, and V. Shmatikov (2019) Differential privacy has disparate impact on model accuracy. In Advances in Neural Information Processing Systems, pp. 15453–15462. Cited by: §1, §1, §2.1, §7.
  • B. K. Beaulieu-Jones, W. Yuan, S. G. Finlayson, and Z. S. Wu (2018) Privacy-preserving distributed deep learning for clinical data. arXiv preprint arXiv:1812.01484. Cited by: §2.1, §8.4.
  • C. J. Berg, H. K. Atrash, L. M. Koonin, and M. Tucker (1996) Pregnancy-related mortality in the united states, 1987–1990. Obstetrics & Gynecology 88 (2), pp. 161–167. Cited by: §8.3.
  • K. Chaudhuri, C. Monteleoni, and A. D. Sarwate (2011) Differentially private empirical risk minimization. Journal of Machine Learning Research 12 (Mar), pp. 1069–1109. Cited by: §1, §4.2, §8.5.
  • K. Chaudhuri and S. A. Vinterbo (2013) A stability-based validation procedure for differentially private machine learning. In Advances in Neural Information Processing Systems, pp. 2652–2660. Cited by: §4.3.
  • Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu (2018) Recurrent Neural Networks for Multivariate Time Series with Missing Values. Scientific Reports 8 (1), pp. 6085 (En). External Links: ISSN 2045-2322, Link, Document Cited by: §B.1.4, §4.1.
  • I. Chen, F. D. Johansson, and D. Sontag (2018) Why is my classifier discriminatory?. In Advances in Neural Information Processing Systems, pp. 3539–3550. Cited by: §1, §2.2, §7.
  • I. Y. Chen, E. Pierson, S. Rose, S. Joshi, K. Ferryman, and M. Ghassemi (2020a) Ethical machine learning in health. arXiv preprint arXiv:2009.10576. Cited by: §1, §8.2.
  • I. Y. Chen, P. Szolovits, and M. Ghassemi (2019) Can ai help reduce disparities in general medical and mental health care?. AMA Journal of Ethics 21 (2), pp. 167–179. Cited by: §8.2.
  • X. Chen, Z. S. Wu, and M. Hong (2020b) Understanding gradient clipping in private sgd: a geometric perspective. arXiv preprint arXiv:2006.15429. Cited by: §5.2.
  • J. P. Cohen, P. Morrison, L. Dao, K. Roth, T. Q. Duong, and M. Ghassemi (2020) Covid-19 image data collection: prospective predictions are the future. arXiv preprint arXiv:2006.11988. Cited by: §1, §8.1.
  • R. D. Cook and S. Weisberg (1980) Characterizations of an empirical influence function for detecting influential cases in regression. Technometrics 22 (4), pp. 495–508. Cited by: §A.4.
  • D. R. Cox (1972) Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological) 34 (2), pp. 187–202. Cited by: §4.1.
  • P. Croft, D. G. Altman, J. J. Deeks, K. M. Dunn, A. D. Hay, H. Hemingway, L. LeResche, G. Peat, P. Perel, S. E. Petersen, et al. (2015) The science of clinical practice: disease diagnosis or patient prognosis? evidence about “what is likely to happen” should shape clinical practice. BMC medicine 13 (1), pp. 20. Cited by: §8.2.
  • R. Cummings, V. Gupta, D. Kimpara, and J. Morgenstern (2019) On the compatibility of privacy and fairness. In Adjunct Publication of the 27th Conference on User Modeling, Adaptation and Personalization, pp. 309–315. Cited by: §2.1.
  • T. Davenport and R. Kalakota (2019)

    The potential for artificial intelligence in healthcare

    Future healthcare journal 6 (2), pp. 94. Cited by: §2.2.
  • S. E. Davis, T. A. Lasko, G. Chen, E. D. Siew, and M. E. Matheny (2017) Calibration drift in regression and machine learning models for acute kidney injury. Journal of the American Medical Informatics Association 24 (6), pp. 1052–1061. Cited by: §2.2.
  • C. Dwork, F. McSherry, K. Nissim, and A. Smith (2006) Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pp. 265–284. Cited by: §1, §8.5.
  • C. Dwork, A. Roth, et al. (2014) The algorithmic foundations of differential privacy.. Foundations and Trends in Theoretical Computer Science 9 (3-4), pp. 211–407. Cited by: §2.1, §8.5.
  • C. Dwork, A. Smith, T. Steinke, and J. Ullman (2017) Exposed! a survey of attacks on private data. Cited by: §1, §2.1, §8.4.
  • M. D. Ekstrand, R. Joshaghani, and H. Mehrpouyan (2018) Privacy for all: ensuring fair and equitable privacy protections. In Conference on Fairness, Accountability and Transparency, pp. 35–47. Cited by: §1.1.
  • N. D. Eneanya, W. Yang, and P. P. Reese (2019) Reconsidering the consequences of using race to estimate kidney function. Jama 322 (2), pp. 113–114. Cited by: §2.2, §8.2.
  • T. Farrand, F. Mireshghallah, S. Singh, and A. Trask (2020) Neither private nor fair: impact of data imbalance on utility and fairness in differential privacy. arXiv preprint arXiv:2009.06389. Cited by: §1, §2.1, §8.2.
  • V. Feldman and T. Zrnic (2020) Individual privacy accounting via a renyi filter. arXiv preprint arXiv:2008.11193. Cited by: §8.2.
  • V. Feldman (2020) Does learning require memorization? a short tale about a long tail. In

    Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing

    pp. 954–959. Cited by: §2.2, §5.2, §8.1.
  • K. Ferryman and M. Pitcan (2018) Fairness in precision medicine. Data & Society. Cited by: §1.
  • M. Fredrikson, S. Jha, and T. Ristenpart (2015) Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pp. 1322–1333. Cited by: §1.
  • Q. Geng, W. Ding, R. Guo, and S. Kumar (2020) Tight analysis of privacy and utility tradeoff in approximate differential privacy. In International Conference on Artificial Intelligence and Statistics, pp. 89–99. Cited by: §1.
  • S. Ghafur, J. Van Dael, M. Leis, A. Darzi, and A. Sheikh (2020) Public perceptions on data sharing: key insights from the uk and the usa. The Lancet Digital Health 2 (9), pp. e444–e446. Cited by: §8.4.
  • M. Ghassemi, T. Naumann, P. Schulam, A. L. Beam, and R. Ranganath (2018) Opportunities in machine learning for healthcare. arXiv preprint arXiv:1806.00388. Cited by: §1.
  • B. N. Greenwood, S. Carnahan, and L. Huang (2018) Patient–physician gender concordance and increased mortality among female heart attack patients. Proceedings of the National Academy of Sciences 115 (34), pp. 8569–8574. Cited by: §8.2.
  • B. N. Greenwood, R. R. Hardeman, L. Huang, and A. Sojourner (2020) Physician–patient racial concordance and disparities in birthing mortality for newborns. Proceedings of the National Academy of Sciences 117 (35), pp. 21194–21200. Cited by: §8.2.
  • V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, R. Kim, R. Raman, P. C. Nelson, J. L. Mega, and D. R. Webster (2016) Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA 316 (22), pp. 2402–2410 (en). Cited by: §1.
  • K. Hajian-Tilaki (2013) Receiver operating characteristic (roc) curve analysis for medical diagnostic test evaluation. Caspian journal of internal medicine 4 (2), pp. 627. Cited by: §5.
  • Y. Halpern, S. Horng, Y. Choi, and D. Sontag (2016) Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association 23 (4), pp. 731–740. Cited by: §8.2.
  • Han-JD (2019) Gated recurrent unit with a decay mechanism for multivariate time series with missing values,. External Links: Link Cited by: §F.2.
  • M. Hardt, E. Price, and N. Srebro (2016) Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §1, §2.2, §7.
  • H. Harutyunyan, H. Khachatrian, D. C. Kale, G. V. Steeg, and A. Galstyan (2017) Multitask Learning and Benchmarking with Clinical Time Series Data. arXiv:1703.07771 [cs, stat]. Note: arXiv: 1703.07771Comment: This version of the paper adds details about the generation of the benchmark tasks and describes improved neural baselines External Links: Link Cited by: §1, §3.2.
  • M. B. Hawes (2020) Implementing differential privacy: seven lessons from the 2020 united states census.

    Harvard Data Science Review

    Note: External Links: Document, Link Cited by: §1, §2.1.
  • A. Heiat, C. P. Gross, and H. M. Krumholz (2002) Representation of the elderly, women, and minorities in heart failure clinical trials. Archives of internal medicine 162 (15). Cited by: §8.2.
  • D. Herrera-Perez, A. Haslam, T. Crain, J. Gill, C. Livingston, V. Kaestner, M. Hayes, D. Morgan, A. S. Cifu, and V. Prasad (2019) Meta-research: a comprehensive review of randomized clinical trials in three medical journals reveals 396 medical reversals. Elife 8, pp. e45183. Cited by: §1.
  • G. Hripcsak, P.B. Ryan, J.D. Duke, N.H. Shah, R.W. Park, V. Huser, M.A. Suchard, M.J. Schuemie, F.J. DeFalco, A. Perotte, et al. (2016) Characterizing treatment pathways at scale using the ohdsi network. Proceedings of the National Academy of Sciences 113 (27), pp. 7329–7336. Cited by: §1, §8.2.
  • M. Jagielski, J. Ullman, and A. Oprea (2020) Auditing differentially private machine learning: how private is private sgd?. arXiv preprint arXiv:2006.07709. Cited by: §4.2, item 2.
  • B. Jayaraman and D. Evans (2019) Evaluating differentially private machine learning in practice. In 28th USENIX Security Symposium (USENIX Security 19), pp. 1895–1912. Cited by: §1.
  • A. E. W. Johnson, T. J. Pollard, L. Shen, L. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark (2016) MIMIC-III, a freely accessible critical care database. Scientific Data 3 (1), pp. 1–9 (en). External Links: ISSN 2052-4463, Link, Document Cited by: §B.1.1, §1, §3.2.
  • J. Jordon, D. Jarrett, J. Yoon, T. Barnes, P. Elbers, P. Thoral, A. Ercole, C. Zhang, D. Belgrave, and M. van der Schaar (2020) Hide-and-seek privacy challenge. arXiv preprint arXiv:2007.12087. Cited by: §1.
  • C. Jung, K. Ligett, S. Neel, A. Roth, S. Sharifi-Malvajerdi, and M. Shenfeld (2019) A new analysis of differential privacy’s generalization guarantees. arXiv preprint arXiv:1909.03577. Cited by: §1, §1, §2.1, §6.
  • K. Jung and N. H. Shah (2015) Implications of non-stationarity on predictive modeling using ehrs. Journal of biomedical informatics 58, pp. 168–174. Cited by: §2.2.
  • S. Kalkman, J. van Delden, A. Banerjee, B. Tyl, M. Mostert, and G. van Thiel (2019) Patients’ and public views and attitudes towards the sharing of health data for research: a narrative review of the empirical evidence. Journal of Medical Ethics. External Links: Document, ISSN 0306-6800, Link, Cited by: §8.4.
  • M. Kearns, S. Neel, A. Roth, and Z. S. Wu (2018) Preventing fairness gerrymandering: auditing and learning for subgroup fairness. In International Conference on Machine Learning, pp. 2564–2572. Cited by: §1.
  • M. Kearns, S. Neel, A. Roth, and Z. S. Wu (2019) An empirical study of rich subgroup fairness for machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 100–109. Cited by: §1.
  • M. Kearns, A. Roth, Z. S. Wu, and G. Yaroslavtsev (2015) Privacy for the protected (only). arXiv preprint arXiv:1506.00242. Cited by: §8.3.
  • C. J. Kelly, A. Karthikesalingam, M. Suleyman, G. Corrado, and D. King (2019) Key challenges for delivering clinical impact with artificial intelligence. BMC medicine 17 (1), pp. 195. Cited by: §2.2.
  • R. Kocher, E. J. Emanuel, and N. M. DeParle (2010) The affordable care act and the future of clinical medicine: the opportunities and challenges. Annals of internal medicine 153 (8), pp. 536–539. Cited by: §8.1.
  • P. W. Koh and P. Liang (2017) Understanding black-box predictions via influence functions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1885–1894. Cited by: §A.4, §1, §2.2, §7.
  • P. W. W. Koh, K. Ang, H. Teo, and P. S. Liang (2019) On the accuracy of influence functions for measuring group effects. In Advances in Neural Information Processing Systems, pp. 5255–5265. Cited by: §A.4, §2.2.
  • H. Landi (2020) Judge dismisses data sharing lawsuit against university of chicago, google. External Links: Link Cited by: §8.4.
  • Y. LeCun, C. Cortes, and C. Burges (2010) MNIST handwritten digit database. Cited by: §1, §3.1.
  • N. Li, T. Li, and S. Venkatasubramanian (2007) T-closeness: privacy beyond k-anonymity and l-diversity. In 2007 IEEE 23rd International Conference on Data Engineering, pp. 106–115. Cited by: §2.1.
  • J. Liu and K. Talwar (2019) Private selection from private candidates. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing, pp. 298–309. Cited by: §4.3.
  • T. Martin (2011) The color of kidneys. American Journal of Kidney Diseases 58 (5), pp. A27–A28. Cited by: §2.2, §8.2.
  • J. M. McGinnis, L. Stuckhardt, R. Saunders, M. Smith, et al. (2013) Best care at lower cost: the path to continuously learning health care in america. National Academies Press. Cited by: §8.2.
  • [66] MIMIC. Note:, note = Accessed: 2020-09-30 Cited by: §3.2.
  • L. Na, C. Yang, C. Lo, F. Zhao, Y. Fukuoka, and A. Aswani (2018) Feasibility of reidentifying individuals in large national physical activity data sets from which protected health information has been removed with use of machine learning. JAMA network open 1 (8), pp. e186040–e186040. Cited by: §8.4.
  • A. Narayanan and V. Shmatikov (2008) Robust de-anonymization of large sparse datasets [netflix]. In IEEE Symposium on Research in Security and Privacy, Oakland, CA, Cited by: §1.
  • S. Neel, A. Roth, G. Vietri, and Z. S. Wu (2019) Differentially private objective perturbation: beyond smoothness and convexity. arXiv preprint arXiv:1909.01783. Cited by: §1, §4.2.
  • B. Nestor, M. B. A. McDermott, W. Boag, G. Berner, T. Naumann, M. C. Hughes, A. Goldenberg, and M. Ghassemi (2019) Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks. Cited by: §B.1.5, §1, §2.2, §3.2, §8.1.
  • T. T. Nguyên, X. Xiao, Y. Yang, S. C. Hui, H. Shin, and J. Shin (2016) Collecting and analyzing data from smart device users with local differential privacy. arXiv preprint arXiv:1606.05053. Cited by: §1, §2.1.
  • K. Nissim and U. Stemmer (2015) On the generalization properties of differential privacy. CoRR, abs/1504.05800. Cited by: §1.
  • K.J. O’malley, K.F. Cook, M.D. Price, K.R. Wildes, J.F. Hurdle, and C.M. Ashton (2005) Measuring diagnoses: ICD code accuracy. Health Services Research 40 (5p2), pp. 1620–1639. Cited by: §8.2.
  • Z. Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan (2019) Dissecting racial bias in an algorithm used to manage the health of populations. Science 366 (6464), pp. 447–453. Cited by: §2.2, §7.
  • J. M. Orsi, H. Margellos-Anast, and S. Whitman (2010) Black–white health disparities in the united states and chicago: a 15-year progress analysis. American journal of public health 100 (2), pp. 349–356. Cited by: §2.2, §7.
  • N. Papernot, M. Abadi, Ú. Erlingsson, I. Goodfellow, and K. Talwar (2016) Semi-supervised knowledge transfer for deep learning from private training data. External Links: 1610.05755 Cited by: §4.2.
  • N. Papernot, S. Chien, S. Song, A. Thakurta, and U. Erlingsson (2020) Making the shoe fit: architectures, initializations, and tuning for learning with privacy. External Links: Link Cited by: §1, §4.1, §8.5.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and É. Duchesnay (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. External Links: ISSN 1533-7928, Link Cited by: §F.1.
  • S. R. Pfohl, A. M. Dai, and K. Heller (2019) Federated and differentially private learning for electronic health records. arXiv preprint arXiv:1911.05861. Cited by: §2.1.
  • S. R. Pfohl, A. Foryciarz, and N. H. Shah (2020) An empirical characterization of fair machine learning for clinical risk prediction. arXiv preprint arXiv:2007.10306. Cited by: §8.2.
  • S. Rabanser, S. Günnemann, and Z. Lipton (2019) Failing loudly: an empirical study of methods for detecting dataset shift. In Advances in Neural Information Processing Systems, pp. 1394–1406. Cited by: §6.
  • A. Rajkomar, M. Hardt, M. D. Howell, G. Corrado, and M. H. Chin (2018a) Ensuring fairness in machine learning to advance health equity. Annals of internal medicine 169 (12), pp. 866–872. Cited by: §8.2.
  • A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun, et al. (2018b) Scalable and accurate deep learning with electronic health records. NPJ Digital Medicine 1 (1), pp. 18. Cited by: §1.
  • S. Rose (2018) Machine learning for prediction in electronic health data. JAMA network open 1 (4), pp. e181404–e181404. Cited by: §8.2.
  • C. G. Schwarz, W. K. Kremers, T. M. Therneau, R. R. Sharp, J. L. Gunter, P. Vemuri, A. Arani, A. J. Spychalla, K. Kantarci, D. S. Knopman, et al. (2019) Identification of anonymous mri research participants with face-recognition software. New England Journal of Medicine 381 (17), pp. 1684–1686. Cited by: §8.4.
  • L. Seyyed-Kalantari, G. Liu, M. McDermott, and M. Ghassemi (2020) CheXclusion: fairness gaps in deep chest x-ray classifiers. arXiv preprint arXiv:2003.00827. Cited by: §F.4, §2.2, §4.1, §7.
  • R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §1, §1.
  • S. Song, O. Thakkar, and A. Thakurta (2020) Characterizing private clipped gradient descent on convex generalized linear problems. arXiv preprint arXiv:2006.06783. Cited by: §5.2.
  • S. M. Stigler (1989) Francis galton’s account of the invention of correlation. Statistical Science, pp. 73–79. Cited by: §6.
  • K. Stronks, N. F. Wieringa, and A. Hardon (2013) Confronting diversity in the production of clinical evidence goes beyond merely including under-represented groups in clinical trials. Trials 14 (1), pp. 1–6. Cited by: §8.2.
  • A. Subbaswamy, P. Schulam, and S. Saria (2018) Preventing failures due to dataset shift: learning predictive models that transport. arXiv preprint arXiv:1812.04597. Cited by: §2.2.
  • L. Sweeney (2002) K-anonymity: a model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (05), pp. 557–570. Cited by: §2.1.
  • L. Sweeney (2015) Only you, your doctor, and many others may know. Technology Science 2015092903 (9), pp. 29. Cited by: §1.
  • J. Tang, A. Korolova, X. Bai, X. Wang, and X. Wang (2017) Privacy loss in apple’s implementation of differential privacy on macos 10.12. arXiv preprint arXiv:1709.02753. Cited by: §1, §2.1.
  • N. Tomašev, X. Glorot, J. W. Rae, M. Zielinski, H. Askham, A. Saraiva, A. Mottram, C. Meyer, S. Ravuri, I. Protsyuk, A. Connell, C. O. Hughes, A. Karthikesalingam, J. Cornebise, H. Montgomery, G. Rees, C. Laing, C. R. Baker, K. Peterson, R. Reeves, D. Hassabis, D. King, M. Suleyman, T. Back, C. Nielson, J. R. Ledsam, and S. Mohamed (2019) A clinically applicable approach to continuous prediction of future acute kidney injury. Nature 572 (7767), pp. 116–119 (en). Cited by: §1.
  • E. J. Topol (2019) High-performance medicine: the convergence of human and artificial intelligence. Nature medicine 25 (1), pp. 44–56. Cited by: §1, §8.1.
  • J. Travers, S. Marsh, M. Williams, M. Weatherall, B. Caldwell, P. Shirtcliffe, S. Aldington, and R. Beasley (2007) External validity of randomised controlled trials in asthma: to whom do the results of the trials apply?. Thorax 62 (3), pp. 219–223. Cited by: §8.2.
  • S. Vadhan (2017) The complexity of differential privacy. In Tutorials on the Foundations of Cryptography, pp. 347–450. Cited by: §5.3.
  • G. Vietri, B. Balle, A. Krishnamurthy, and Z. S. Wu (2020)

    Private reinforcement learning with pac and regret guarantees

    arXiv preprint arXiv:2009.09052. Cited by: §8.5.
  • D. A. Vyas, L. G. Eisenstein, and D. S. Jones (2020) Hidden in plain sight—reconsidering the use of race correction in clinical algorithms. Mass Medical Soc. Cited by: §2.2.
  • S. Wang, M. B. A. McDermott, G. Chauhan, M. C. Hughes, T. Naumann, and M. Ghassemi (2019) MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III. arXiv:1907.08322 [cs, stat]. Note: arXiv: 1907.08322 External Links: Link Cited by: §B.1.1, §1, §3.2, §4.1.
  • X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers (2017) Chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 2097–2106. Cited by: §1, §3.1.
  • D. Wu, H. Kobayashi, C. Ding, L. Cheng, and K. G. M. Ghassemi (2019) Modeling the Biological Pathology Continuum with HSIC-regularized Wasserstein Auto-encoders. arXiv:1901.06618 [cs, stat]. Note: arXiv: 1901.06618 External Links: Link Cited by: §1.
  • X. Wu, F. Li, A. Kumar, K. Chaudhuri, S. Jha, and J. Naughton (2017) Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1307–1322. Cited by: §8.5.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §1, §3.1.
  • S. Yu, Y. Ma, J. Gronsbell, T. Cai, A.N. Ananthakrishnan, V.S. Gainer, S.E. Churchill, P. Szolovits, S.N. Murphy, I.S. Kohane, et al. (2017) Enabling phenotypic big data with phenorm. Journal of the American Medical Informatics Association 25 (1), pp. 54–60. Cited by: §8.2.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §5.2.

10. Appendices

Appendix A Background

a.1. Differential Privacy

Formally, a learning algorithm that trains models from the dataset satisfies (,)-DP if the following holds for all training datasets and with a Hamming distance of 1:


The parameter measures the formal privacy guarantee by defining an upper bound on the privacy loss in the worst possible case. A smaller represents stronger privacy guarantees. The factor allows for some probability that the property may not hold. For the privacy guarantees of a subgroup of size , differential privacy defines the guarantee as:


This group privacy guarantee means that the privacy guarantee degrades linearly with the size of the group.

a.2. Measuring Dataset Shift

First, we create a balanced dataset made up of samples from both the training and test distributions. We then train a domain classifer, a binary classifier to distinguish samples from being in the training or test distribution on this dataset. We measure whether the accuracy of the domain classifier is significantly better than random chance (0.5), which means dataset shift is significant, using binomial testing. We setup the binomial test as:


Under the null hypothesis, the accuracy of the classifier follows a binomial distribution:

where is the number of samples in the test set.

Assuming that the p-value is less than 0.05 for our hypothesis test on the accuracy of our domain classifier being better than random chance, we diagnose the malignancy of the shift. We train a binary prediction classifier without privacy on the original training set of patients. Then, we select the top 100 samples that the domain classifier most confidently predicted as being in the test distribution. Finally, we determine the malignancy of the shift by evaluating the performance of our prediction classifiers on the selected top 100 samples. Low accuracy means that the shift is malignant.

a.3. Fairness Definitions

max width= Fairness Metric Definition Gap Name Gap Equation Demographic parity Parity Gap Equality of opportunity (positive class) Recall Gap Equality of opportunity (negative class) Specificity Gap

Table 5.

The three equalized odds definitions of fairness that we use in the binary prediction tasks. The recall gap is most relevant in health care where we aim to minimize false negatives.

a.4. Influence Functions

Using the approach from (Koh and Liang, 2017) we analyze the influence of all training points on the loss for each test point defined in Equation 4. The approach formalizes the goal of understanding how the model’s predictions would change if we removed a training point. This is a natural connection to differential privacy which confers algorithmic stability by bounding the influence that any one training point has on the output distribution. First, the approach uses influence functions to approximate the change in model parameters by computing the parameter change if was upweighted by some small . The new parameters of the model are defined by = . Next, applying the results from (Cook and Weisberg, 1980)

and applying the chain rule the authors achieve Equation 

4 to characterize the influence that a training point has the loss on another test point. Influence functions use an additive property for interpreting the influence of subgroups showing that the group influence is the sum of the influences of all individual points in the subgroup but that this is usually an underestimate of the true influence of removing the subgroup (Koh et al., 2019).


Appendix B Data Processing

b.1. Mimic-Iii

b.1.1. Cohort

Within the MIMIC-III dataset (Johnson et al., 2016), each individual patient may be admitted to the hospital on multiple different occasions and may be transferred to the ICU multiple times during their stay. We choose to focus on a patient’s first visit to the ICU, which is the most common case. Thus, we extract a cohort of patient EHR that corresponds to first ICU visits. We also only focus on ICU stays that lasted at least 36 hours and all patients older than 15 years of age. Using the MIMIC-Extract (Wang et al., 2019) processing pipeline, this results in a cohort of 21,877 unique ICU stays. The breakdown of this cohort by year, ethnicity and class labels for each task can be found in Table 6.

b.1.2. Features: Demographics and Hourly Labs and Vitals

7 static demographic features and 181 lab and vital measurements which vary over time are collected for each patient’s stay. The 7 demographic features comprise gender and race attributes which we observe for all patients. Meanwhile, the 181 lab results and vital signs have a high rate of missingness (90.6%) because tests are only ordered based on the medical needs of each patient. These tests also incur infrequently over time. This results in our dataset having irregularly sampled time series with high rates of missingness.

b.1.3. Transformation to 24-hour time-series

All time-varying measurements are aggregated into regularly-spaced hourly buckets (0-1 hr, 1-2 hr, etc.). Each recorded hourly value is the mean of any measurements captured in that hour. Each numerical feature is normalized to have zero mean and unit variance. The input to each prediction model is made up of two parts: the 7 demographic features and an hourly multivariate time-series of labs and vitals. The time series are censored to a fixed-duration of the first 24 hours to represent the first day of a patient’s stay in the ICU. This means all of our prediction tasks are performed based on the first day of a patient’s stay in the ICU.

b.1.4. Imputation of missing values

We impute our data to deal with the high rate of missingness using a strategy called ”simple imputation” developed by 

(Che et al., 2018) for MIMIC time-series prediction tasks. Each separate univariate measurement is forward filled, concatenated with a binary indicator if the value was measured within that hour and concatenated with the time since the last measurement of this value.

b.1.5. Clinical Aggregations Representation

Data representation is known to be important in building robust machine learning models. Unfortunately, there is a lack of medical data representations that are standards compared to representations such as Gabor filters is computer vision. (Nestor et al., 2019) explore four different representations for medical prediction tasks and demonstrate that their medical Aggregations representation is the most robust to temporal dataset shift. Thus, we use the medical aggregations representation to train our models. This representation groups together values that measure the same physiological quantity but are under different ItemIDs in the different EHR. This reduces the original 181 time-varying values to 68 values and reduces the rate of missingness to 78.25% before imputation.

b.2. NIH Chest X-Ray

We resize all images to 256x256 and normalize via the mean and standard deviation of the ImageNet dataset. We apply center crop, random horizontal flip, and validation set early stopping to select the optimal model. We further perform random 10 degree rotation as data augmentation.

Appendix C Data Statistics

c.1. Mimic-Iii

For understanding of the class imbalance and ethnicity frequency in the binary prediction health care tasks, a table of these statistics is provided in Table 6 which show the imbalance between ethnicities across all tasks and the class imbalance in the mortality task.

max width= Ethnicity Breakdown Total Mortality Length of Stay Year Ethnicity Negative Positive Negative Positive 2001 Asian 5 80% 20% 60% 40% Black 48 90% 10% 52% 48% Hispanic 7 86% 14% 57% 43% Other 217 86% 14% 53% 47% White 319 92% 8% 54% 46% 2002 Asian 23 100% 0% 48% 52% Black 102 92% 8% 56% 44% Hispanic 25 96% 4% 52% 48% Other 520 88% 12% 49% 51% White 937 93% 7% 51% 49% 2003 Asian 34 94% 6% 38% 62% Black 116 97% 3% 58% 24% Hispanic 45 96% 4% 53% 47% Other 465 90% 10% 46% 54% White 1203 94% 6% 50% 50% 2004 Asian 31 94% 6% 68% 32% Black 134 96% 4% 51% 49% Hispanic 38 89% 11% 47% 53% Other 353 90% 10% 45% 55% White 1236 93% 7% 51% 49% 2005 Asian 50 90% 10% 46% 54% Black 142 96% 4% 56% 44% Hispanic 48 96% 4% 56% 44% Other 279 89% 11% 48% 52% White 1323 91% 9% 51% 49% 2006 Asian 60 92% 8% 47% 53% Black 160 96% 4% 59% 41% Hispanic 62 97% 3% 45% 55% Other 215 89% 11% 49% 51% White 1434 93% 7% 54% 46% 2007 Asian 58 93% 7% 59% 41% Black 170 95% 5% 55% 45% Hispanic 73 99% 1% 59% 41% Other 235 88% 12% 52% 48% White 1645 93% 7% 54% 46% 2008 Asian 69 93% 8% 51% 49% Black 162 98% 2% 56% 44% Hispanic 90 93% 7% 46% 54% Other 136 91% 9% 56% 44% White 1691 93% 7% 54% 46% 2009 Asian 53 94% 6% 66% 34% Black 150 94% 6% 59% 41% Hispanic 70 93% 7% 59% 41% Other 180 87% 13% 57% 43% White 1612 93% 7% 55% 45% 2010 Asian 55 87% 13% 47% 53% Black 177 97% 3% 65% 35% Hispanic 71 96% 4% 54% 46% Other 303 87% 13% 52% 48% White 1568 94% 6% 52% 48% 2011 Asian 63 92% 8% 63% 37% Black 191 93% 7% 55% 45% Hispanic 89 96% 4% 53% 47% Other 268 85% 15% 45% 55% White 1622 95% 5% 54% 46% 2012 Asian 42 98% 2% 52% 48% Black 127 95% 5% 53% 47% Hispanic 55 93% 7% 71% 29% Other 276 91% 9% 54% 46% White 945 92% 8% 58% 42%

Table 6. This is a breakdown of patients in our cohort by ethnicity for each year for both tasks.

c.2. NIH Chest X-Ray

# of Images # of Patients View Male Female
112,120 30,805 Front 56.49% 43.51%
Table 7. Sex breakdown of images in NIH Chest X-Ray dataset
Disease Label Negative Percentage Positive Class Percentage
Atelectasis 94.48% 5.52%
Cardiomegaly 2.50% 97.50%
Consolidation 1.40% 98.60%
Edema 0.26% 99.74%
Effusion 4.16% 85.84%
Emphysema 0.86% 99.14%
Fibrosis 1.85% 98.15%
Hernia 0.27% 99.73%
Infiltration 11.70% 88.30%
Mass 4.16% 95.84%
Nodule 5.39% 94.61%
Pleural Thickening 2.48% 97.52%
Pneumonia 0.55% 99.45%
Pneumothorax 0.88% 99.22%
Table 8. Disease label breakdown of images in NIH Chest X-Ray dataset

Appendix D MIMIC-III: Dataset Shift Quantification

d.1. Domain Classifier

Using the domain classifier method presented in the main paper, we evaluate both the significance of the dataset shift and the malignancy of the shift. The shift between EHR systems is most malignant in the mortality task for LR while there are not highly malignant shifts in LOS or intervention prediction for LR (Table 9). Meanwhile, in GRUD the shift between the EHR systems is no longer malignant across any of the binary tasks and the shift is relatively more malignant in the intervention prediction task for CNNs (Table 10). The domain classifier performed significantly better than random chance if the p-value in parentheses is less than .

max width= Task Year Mortality LOS Intervention Prediction (Vaso) 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Table 9. Shift malignancy with statistical significance of domain classifier performance in parentheses. Lower values represent higher malignancy. LR was used for both the domain classifier and determining the accuracy on the top 100 most anomalous samples. The shift between EHRs is most malignant in the mortality task.

max width= Task Year Mortality LOS Intervention Prediction (Vaso) 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Table 10. Shift malignancy with statistical significance of domain classifier performance in parentheses. GRUD was used for both the domain classifier and determining the accuracy on the top 100 most anomalous samples for mortality and LOS. CNN was used for both the domain classifier and determining the accuracy on the the top 100 most anomalous samples for intervention prediction (Vaso). Malignant dataset shift appears in the earlier years in mortality and LOS. None of the shifts are malignant in intervention prediction.

Appendix E Algorithm Definitions

e.1. DP-SGD Algorithm

  Input: Examples {,…,}, loss function =. Parameters: learning rate , noise multiplier , mini-batch size , norm bound .
  Initialize randomly
  for   do
     Take a random mini-batch with sampling probability
     Compute gradient
     For each , compute
     Clip gradient
     Add noise
  end for
  Output: and the overall privacy cost .
Algorithm 1 Differentially private SGD

e.2. Objective Perturbation

  Inputs: Data , parameters
  Output: Approximate minimizer
  The weights of the linear model are defined as f
  The private loss function is defined as
  Normalize all the records in by
  Let - +
  If , then , else , and
  Draw a vector b according to (b) = with
  Compute = argmin
Algorithm 2 Objective perturbation for differentially private LR

Appendix F Training Setup and Details

We trained all of our models on a single NVIDIA P100 GPU, 32 CPUs and with 32GB of RAM. We perform each model run using five random seeds so that we are able to produce our results with mean and variances. Furthermore, each one is trained in the three privacy settings of none, low, and high. Each setting required a different set of hyperparameters which are discussed in the below sections. For all of the privacy models we fix the batch size to be 64 and the number of microbatches to be 16 resulting in four examples per microbatch.

f.1. Logistic Regression

LR models are linear classification models of low capacity and moderate interpretability. Because LR does not naturally handle temporal data, 24 one-hour buckets of patient history are concatenated into one vector along with the static demographic vector. For training with no privacy, we use the LR implementation in SciKit Learn’s LogisticRegression (Pedregosa et al., 2011)

class. We perform a random search over the following hyperparameters: regularisation strength (C), regularisation type (L1 or L2), solvers (“liblinear” or “saga”), and maximum number of iterations. For private training, we implement the model using Tensorflow 

(Abadi et al., 2016) and Tensorflow Privacy. We use L2 regularisation and perform a grid search over the following hyperparameters: over the number of epochs (5 10 15 20) and learning rate (0.001 0.002 0.005 0.01). Finally, we use the DP-SGD optimizer implemented in Tensorflow Privacy to train our models.

f.2. Gru-D

GRU-D models are a recent variant of recurrent neural networks (RNNs) designed to specifically model irregularly sampled timeseries by inducing learned exponential regressions to the mean for unobserved values. Note that GRU-D is intentionally designed to account for irregularly sampled timeseries (or equivalently timeseries with missingness). We implemented the model in Tensorflow based on a publicly available PyTorch implementation

(Han-JD, 2019). For both not private and private training we use a hidden layer size of 67 units, batch normalisation, and dropout with a probability of 0.5 on the classification layer like in the original work. We use the Adam optimizer for not private training and the DPAdam optimizer for private training with early stopping criteria for both.

f.3. Cnn

We use three layer 1D CNN models with max pooling layers and ReLU activation functions. For both all models we perform a grid search over the following hyperparameters: dropout (0.1 0.2, 0.3, 0.4, 0.5), number of epochs (12, 15, 20), and learning rate (0.001, 0.002, 0.005, 0.01). Finally, we use the DPAdam optimizer for private learning.

f.4. DenseNet-121

We finetune a DenseNet-121 model that was pretrained on ImageNet using the Adam optimizer without DP and the DPAdam optimizer for private learning. We take the hyperparameters stated in (Seyyed-Kalantari et al., 2020).

Appendix G Objective Perturbation Results

g.1. Utility Tradeoff

max width= MIMIC-III AUROC Task Model None () Low () High () Mortality LR Length of Stay ¿ 3 LR AUPRC Mortality LR Length of Stay ¿ 3 LR

Table 11. privacy-utility tradeoff across vision and health care tasks. The health care tasks have a more significant tradeoff between the High and Low or None setting. The tradeoff is better in more balanced tasks (length of stay and intervention onset), and worst in tasks such as mortality.

g.2. Robustness Tradeoff

Figure 4. Characterizing the effect of DP learning on robustness to non-stationarity and concept shift. One instance of increased robustness in the 2009 column for mortality prediction in the high privacy setting (A), but this does not hold across all tasks and models. Performance drops in the 2009 column for LOS in both LR and GRUD (B), and a much worse drop in the high privacy CNN for intervention prediction (C).
Figure 5. Characterizing the effect of DP learning on robustness to non-stationarity and concept shift. One instance of increased robustness in the 2009 column for mortality prediction in the high privacy setting (A), but this does not hold across all tasks and models. Performance drops in the 2009 column for LOS in both LR and GRUD (B), and a much worse drop in the high privacy CNN for intervention prediction (C).

Appendix H Additional DP-SGD Results

h.1. Health Care AUPRC Analysis

In health care settings, the ability of the classifier to predict the positive class is important. We characterize the effect of differential privacy on this further by measuring the average performance across the years (Table 12) and the robustness across the years (Fig. 6) with area under the precision recall curve (AUPRC) and AUPRC (Micro) for the intervention prediction task.

max width= Task Model None Low High Mortality LR GRUD Length of Stay ¿ 3 LR GRUD Intervention Onset (Vaso) LR CNN

Table 12. privacy-utility tradeoff for AUPRC in health care tasks. The health care tasks have a more significant tradeoff between the High and Low or None setting. The tradeoff is better in more balanced tasks (length of stay and intervention onset), and worst in tasks such as mortality where class imbalance is present. There is a 23% and 24% drop in the AUPRC (Micro) between no privacy and high privacy settings for mortality prediction for LR and GRUD respectively.
Figure 6. Characterizing the effect of DP learning on AUPRC robustness to non-stationarity and concept shift. One instance of increased robustness in the 2009 column for mortality prediction in the high privacy setting (A), but this does not hold across all tasks and models. Performance drops in the 2009 column for LOS in both LR and GRUD (B), and a much worse drop in the high privacy CNN for intervention prediction (C).

h.2. MIMIC-III Robustness Correlation Analysis

To characterize a significant impact on robustness due to DP we perform a Pearson’s correlation test between the generalization gap and the malignancy of shift. The generalization gap is measured as the difference between the classifier’s performance on an in-distribution test set and an out of distribution test set. The results from this correlation analysis demonstrate that differential privacy provides no conclusive effect on the robustness to dataset shift (Table 13).

max width= Pearson’s Correlation Task Model None Low High Mortality LR GRUD Length of Stay ¿ 3 LR GRUD
Intervention Onset (Vaso)

Table 13. We calculate the pearson’s correlation between AUROC gap and the malignancy of the shift. Positive correlations mean a lack of robustness since the generalization gap increases as the shift becomes more significant. Negative correlations mean improved robustness since the generalization gap decreases as the shift becomes more significant. We notice that the differential privacy improves robustness when a malignant shift is present in mortality but results in worse robustness in length of stay. None of the correlations are statistically significant so a claim cannot be made that differential privacy improves robustness to dataset shift in health care.

h.3. Fairness Analysis

h.3.1. Averages

max width= AUROC Gap Protected Attribute Model None Low High Mortality Ethnicity LR GRUD Length of Stay ¿ 3 Ethnicity LR GRUD Intervention Onset (Vaso) Ethnicity LR CNN NIH Chest X-Ray Sex DenseNet-121 Recall Gap Mortality Ethnicity LR GRUD Length of Stay ¿ 3 Ethnicity LR GRUD NIH Chest X-Ray Sex DenseNet-121 Parity Gap Mortality Ethnicity LR GRUD Length of Stay ¿ 3 Ethnicity LR GRUD NIH Chest X-Ray Sex DenseNet-121 Specificity Gap Mortality Ethnicity LR GRUD Length of Stay ¿ 3 Ethnicity LR GRUD NIH Chest X-Ray Sex DenseNet-121

Table 14. The fairness gaps between white and Black patients across the different health care tasks, privacy levels and models. Positive values represent a bias towards the white patients and negative values represent a bias towards the Black patients. The models are more fair as the metric moves towards zero. The models are more unfair as the metric moves away from zero.

h.3.2. Time Series

Although there is no effect on fairness from differential privacy when we average the metrics across all the years, we investigate how the metrics vary across each year we tested on. There is greater variance in the normal models than the private models for the parity, recall, and specificity gap (Fig 8 and 9). In intervention prediction, there is one spike in unfairness in 2007 which is seen across all levels of privacy (Fig. 7).

Figure 7. Characterizing the effect of differentially private learning on fairness with respect to non-stationarity and concept shift in the intervention prediction task. We find that all models experience similar performance with respect to the AUROC gap.
Figure 8. Characterizing the effect of differentially private learning on fairness with respect to non-stationarity and concept shift in the mortality prediction task. We find that the models trained without privacy experience high variance across the years across all definitions while models trained with privacy exhibit greater stability.
Figure 9. Characterizing the effect of differentially private learning on fairness with respect to non-stationarity and concept shift in the length of stay prediction task. We find that the models trained without privacy experience high variance across the years across all definitions while models trained with privacy exhibit greater stability.

Appendix I Influence Functions

i.1. Fairness Graph

Figure 10. Group influence of training data per ethnic groups on 100 test patients with highest influence variance. The group influence of our majority ethnicity (white patients) is enhanced significantly in the high privacy setting, as demonstrated by the increased amplitude of those points in (B) and (D). In the no privacy setting the group influence of each ethnicity is similar for both white (A) and Black patients (C).

i.2. Privacy Makes the Most Harmful and Helpful Training Patients Personal

We demonstrate that for the patients with the highest influence variance that their most helpful and harmful training patients are more common amongst other patients in the no privacy setting to them (Table 15 and 17). This means that some patients carry their large influence around the test set of patients. This has important implications for differential privacy since it bounds the influence that anyone patient will have on the test loss of any other patient. The effect of the bounding is observed making the most helpful and harmful patients more personal to each patient in the high privacy setting (Table 16 and 18).

max width= Most Helpful Patients (LR Mortality) Subject ID Most Helpful Influence Count 9980 24 9924 13 98995 8 9954 5 9905 5 985 5 990 4 9929 4 9998 4 99726 3 9942 3 9896 2 9873 2 9867 2 9825 2 9937 2 99938 2 9893 2 9932 1 992 1 99817 1 98899 1 9885 1 98009 1 977 1 99485 1

Table 15. The frequency of the most helpful training patient in the first 100 patients with the highest influence variance for no privacy. Almost 25% of the top 100 share the same most helpful training patient.

max width= Most Helpful Patients (LR Mortality) Subject ID Most Helpful Influence Count 99938 7 9994 5 9977 4 9970 4 9998 3 9991 3 9987 3 9973 3 99528 3 9965 3 9949 2 99598 2 99469 2 9889 2 9980 2 99817 2 99384 2 9984 2 9937 2 99883 2 99726 2 9950 2 99936 2 9983 2 992 2 9974 2 9988 2 9954 2 9924 1 9968 1 9885 1 9882 1 9752 1 9886 1 99038 1 99 1 99063 1 9951 1 98698 1 98919 1 998 1 9967 1 9833 1 9867 1 9818 1 9929 1 99691 1 9813 1 9834 1 9784 1 9915 1 9963 1 9942 1 9960 1

Table 16. The frequency of the most helpful training patient in the first 100 patients with the highest influence variance for high privacy. At most 7% of the top 100 share the same most helpful training patient which is much less than the no privacy setting. Increasing privacy results in the most helpful patient being more personal to the test patient.

max width= Most Harmful Patients (LR Mortality) Subject ID Most Hamrful Influence Count 10013 22 10015 19 10007 13 10045 10 10036 8 10076 5 10088 4 10077 4 1004 3 10102 3 10028 2 10038 2 10173 2 10027 1 10184 1 10289 1

Table 17. The frequency of the most harmful training patient in the first 100 patients with the highest influence variance for no privacy. 22% of the top 100 share the same most harmful training patient.

max width= Most Harmful Patients (LR Mortality) Subject ID Most Harmful Influence Count 100 7 10063 7 10022 6 10007 5 1005 4 10013 4 10026 4 10006 4 10010 4 10032 3 10076 3 10038 3 10050 3 10088 2 10049 2 10059 2 10028 2 10045 2 10046 2 10094 2 1009 2 10040 2 10089 2 10030 2 10112 2 10233 1 10085 1 101 1 10114 1 10149 1 1028 1 10320 1 10110 1 1020 1 10335 1 10043 1 10044 1 10266 1 10042 1 10015 1 10122 1 10217 1 10165 1 10123 1

Table 18. The frequency of the most harmful training patient in the first 100 patients with the highest influence variance for high privacy. At most 7% of the top 100 share the same most harmful training patient which is much less than the no privacy setting. Increasing privacy results in the most harmful patient being more personal to the test patient.

i.3. Years Analysis

We extend our analysis from the main paper by looking at the change in influence across each year for both no privacy and high privacy in the mortality task using LR.

Figure 11. The progression of the influences of all training points on the test points across each year. The high privacy continues to bound the influence for each year agnostic to the dataset shift.