Privacy-Preserving and Communication-Efficient Causal Inference for Hospital Quality Measurement

by   Larry Han, et al.
Harvard University

Data sharing can improve hospital quality measurement, but sharing patient-level data between hospitals is often infeasible due to privacy concerns. Motivated by the problem of evaluating the quality of care provided by candidate Cardiac Centers of Excellence (CCE), we propose a federated causal inference framework to safely leverage information from peer hospitals to improve the precision of quality estimates for a target hospital. We develop a federated doubly robust estimator that is privacy-preserving (requiring only summary statistics be shared between hospitals) and communication-efficient (requiring only one round of communication between hospitals). We contribute to the quality measurement and causal inference literatures by developing a framework for assessing treatment-specific performance in hospitals without sharing patient-level data. We also propose a penalized regression approach on summary statistics of the influence functions for efficient estimation and valid inference. In so doing, the proposed estimator is data-adaptive, downweighting hospitals with different case-mixes from the target hospital for bias reduction and upweighting hospitals with similar case-mixes for efficiency gain. We show the improved performance of the federated global estimator in extensive simulation studies. Studying candidate CCE, we find that the federated global estimator improves precision of treatment effect estimates by 34% to 86% for target hospitals, qualitatively altering the evaluation of the percutaneous coronary intervention (PCI) treatment effect in 22 of 51 hospitals. Focusing on treatment-specific rankings, we find that hospitals rarely excel in both PCI and medical management (MM), stressing the importance of treatment-specific performance assessments.



page 11


Federated Adaptive Causal Estimation (FACE) of Target Treatment Effects

Federated learning of causal estimands may greatly improve estimation ef...

Federated Causal Inference in Heterogeneous Observational Data

Analyzing observational data from multiple sources can be useful for inc...

Stochastic Intervention for Causal Inference via Reinforcement Learning

Causal inference methods are widely applied in various decision-making d...

A Tree-based Federated Learning Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources

Federated learning is an appealing framework for analyzing sensitive dat...

Stochastic Intervention for Causal Effect Estimation

Causal inference methods are widely applied in various decision-making d...

Privacy-Preserving Search for a Similar Genomic Makeup in the Cloud

In this paper, we attempt to provide a privacy-preserving and efficient ...

What's the Harm? Sharp Bounds on the Fraction Negatively Affected by Treatment

The fundamental problem of causal inference – that we never observe coun...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Measuring the Quality of Cardiac Centers of Excellence

Acute myocardial infarction (AMI), commonly known as heart attack, is one of the ten leading causes of hospitalization and death in the United States (hcup2017; benjamin2017). Consequently, hospital quality measurement in AMI has been closely studied, with hospital risk-adjusted mortality rates reported by the Centers for Medicare and Medicaid Services (CMS) since 2007 (krumholz2006). Numerous accreditation organizations release reports on AMI hospital performance and designate high-performing hospitals as Cardiac Centers of Excellence (CCE) (centers). Often, these reports describe a hospital’s overall performance for all patients admitted for AMI, or for a single type of treatment only. However, AMI patients receive different types of treatments depending on the type of AMI they have, their age or comorbid conditions, and the admitting hospital’s technical capabilities and institutional norms.

For example, a cornerstone treatment for AMI is percutaneous coronary intervention (PCI), which is a cardiology procedure that restores blood flow to sections of the heart affected by AMI. PCI has been called one of the ten defining advances of modern cardiology (braunwald2014). It is considered an important part of the AMI treatment arsenal, and many accreditation organizations require hospitals to perform a minimum annual volume to be considered for accreditation as a CCE (centers). However, because not every patient is indicated for PCI (acc2017), or a hospital’s PCI service capacity may be insufficient to meet clinical demand, a substantial share of patients may instead receive medical management (MM) without interventional cardiology. Quality of care on these specific regimens may vary within a hospital and between hospitals because these different forms of treatment may draw on different skills and capabilities.

1.2 Causal Inference for Quality Measurement

Quality measurement is of material importance to providers not only because of its rapidly growing role in reimbursement, but also as an essential tool for helping providers understand their own performance and standing among their peers. Quality measurement programs are also important to patients and have the potential to guide them toward high-quality hospitals. While most quality reports estimate hospitals’ overall performance, it is also important to understand outcomes by specific forms of treatment delivered within hospitals, such as PCI and MM, the latter of which is preferred when PCI is deemed risky or unnecessary. Because providers need to understand their strengths and weaknesses, it is necessary to develop a framework that encompasses hospitals’ performance both overall and on specific alternative treatments, enabling hospitals to make more informed decisions on capital and labor investments to optimize performance. This framework could be useful to patients as well, as it may help them choose providers that excel at delivering the type of treatment they prefer.

To develop and implement this framework in practice, an important challenge to overcome is data sharing. Although data sharing has been shown to improve quality measurement and patient safety (d2021clinical), it is often infeasible to share patient-level data across hospitals due to confidentiality concerns. Traditional methods for hospital quality measurement have relied on either fixed or random effects models (austin2003use; jones2011identification; normand1997statistical). In such approaches, investigators need direct access to patient-level data from other hospitals in order to assess their own performance. Even in the limited settings where direct access is possible, traditional approaches may sometimes underestimate or overestimate the true performance of a target hospital by using other hospitals that are dissimilar in key ways. In particular, such methods can shrink estimates toward the overall population mean, potentially leading to bias for affected target hospitals (george2017).

Ideally, a quality measurement framework would be able to use summary-level data when patient-level data cannot be shared across hospitals, identify relevant peer hospitals in a data-adaptive manner, and properly adjust for differences in patient case mixes. The causal inference framework is well-suited to the task of quality measurement (silber2014; keele2020hospital; longford2020performance), which fundamentally asks a causal question about the effect of medical care on patient outcomes. In the causal inference framework, these efforts have been carried out using direct standardization (varewyck2014shrinkage) or indirect standardization (daignault2017doubly). Recent advances in causal inference can leverage summary-level data from federated data sources, but these approaches are either limited in their ability to specify a target population of interest (vo2021federated; xiong2021federated) or unable to examine treatment-specific outcomes (han2021federated; xiong2021federated), both of which are crucial for hospital quality measurement. These obstacles underscore a substantial need for privacy-preserving, communication-efficient integrative estimation and inferential methods that account for heterogeneity both within local hospitals and across systems.

1.3 Contribution and Outline

We develop a federated causal inference framework and methods to leverage data from multiple hospitals in a fashion that is privacy-preserving (requiring hospitals to share only summary-level statistics rather than patient-level data) and communication-efficient (requiring only one round of communication between hospitals). This is practically important in multi-hospital settings where multiple rounds of communication, such as iterative optimization strategies, would be time-intensive, cost-prohibitive, and could introduce human error (li2020federated). We develop a doubly robust federated global estimator to efficiently and safely aggregate estimates from multiple hospitals. To ensure that comparisons between hospitals are as fair as possible, we account for heterogeneity in the distribution of patient covariates between hospitals through a density ratio weighting approach that does not require patient-level data sharing.

We demonstrate that the federated global estimator is able to avoid negative transfer (pan2009survey) so that incorporating data from other hospitals does not diminish the performance of the estimator. Specifically, we avoid negative transfer by proposing and solving a penalized regression on summary statistics of the influence function. Statistically, this strategy has the advantages of a) producing a data-adaptive estimator, b) ensuring valid inference, and c) efficiently leveraging information from multiple sources. Practically, this strategy allows us to make hospital comparisons on treatment-specific outcomes and identify peer hospitals from hospital-level weights.

In this paper, we focus on the setting where the target population of interest is the case-mix of a particular hospital. However, our method permits flexibility in the specification of the target population, encompassing either direct standardization (where each hospital’s patient sample is re-weighted to be representative of a specific population) and indirect standardization (where a particular hospital’s patient case-mix is the target population) when desired.

We focus on the problem of quality measurement across a set of candidate CCE across states (centers). Studying Medicare patients admitted to these 51 hospitals for AMI, we examine differences in outcomes between patients who received PCI and those who received MM. We also benchmark target hospitals relative to other hospitals (which we term “source” hospitals). Our approach can estimate contrasts within hospitals and between hospitals, whereas the typical quality measurement approach endeavors to only estimate contrasts on all patients between hospitals, or for interventional treatment only. In this sense, our approach can also be seen as a more general approach to quality measurement.

The paper proceeds as follows. In Section 2 we detail the federated Medicare dataset for measuring the quality of care provided by CCE. Section 3 describes the problem set-up, identifying assumptions, estimation procedures, and results for inference. In Section 4, we demonstrate the performance of the federated global estimators in extensive simulation studies. In Section 5, we evaluate the quality of overall AMI care, PCI care, and MM care provided by candidate CCE. Section 6 concludes and discusses possibilities for future extensions of our work.

2 Measuring Quality Rendered by CCE

2.1 Building Medicare Records

Our federated dataset consists of a sample of fee-for-service Medicare beneficiaries who were admitted to short-term acute-care hospitals for AMI from January 1, 2014 through November 30, 2017. A strength of this dataset is that it is representative of the entire fee-for-service Medicare population in the United States. Importantly, the Medicare claims include complete administrative records from inpatient, outpatient, or physician providers. For each patient, we used all of their claims from the year leading up to their hospitalization to characterize their degree of severity upon admission for AMI. ICD-9 and ICD-10 diagnosis and procedure codes in the inpatient records permit easy identification of AMI admissions and PCI treatment status, and mortality status is validated for the purpose of accurate measurement of our outcome.

To ensure data consistency, we excluded patients under age 66 at admission or who lacked 12 months of complete fee-for-service coverage in the year prior to admission. We also excluded admissions with missing or invalid demographic or date information, and patients who were transferred from hospice or another hospital. After exclusions, we randomly sampled one admission per patient.

2.2 Identifying CCE

We examine treatment rates and outcomes for AMI across hospitals that have a sufficient annual volume of PCI procedures to be eligible to be certified as CCE. To be eligible, insurers require that a hospital perform at least 100 PCIs per year (centers), which translates to a minimum of 80 PCI procedures in our four-year 20% sample of Medicare patients. Fifty-one hospitals met the minimum volume threshold for certification as a CCE. Collectively, these 51 candidate CCE treated 11,103 patients in the final federated dataset. These 51 hospitals were distributed across 29 states and included both urban and rural hospitals. The set of hospitals also displayed diverse structural characteristics as defined by data from the Medicare Provider of Services file (posfile), including academic medical centers, non-teaching hospitals, not-for-profit, for-profit, and government-administered hospitals, as well as hospitals with varying levels of available cardiac technology services (silber2018). Thus, although CCE share a common designation, they can be heterogeneous in terms of their characteristics, capacity, and capabilities.

2.3 Patient Covariates, Treatment, and Outcome

Baseline covariates included patient age on admission, admission year, gender, AMI diagnosis subtype, history of PCI or CABG procedures, and history of dementia, heart failure, unstable angina, or renal failure, all four of which were ascertained using one year of look-back through each patient’s claims (krumholz2006; cmsreport2021). We used ICD-9 and ICD-10 procedure codes from the index admission claim to determine whether each patient received PCI treatment (ahrq_pci). Our outcome of interest was all-cause, all-location mortality within 30 days of admission.

3 Methods

3.1 Setting and Notation

For each patient , we observe an outcome , which can be continuous or discrete, a

-dimensional baseline covariate vector

, and a binary treatment indicator , with denoting treatment and for control. In our case study, we examine the outcome of 30-day mortality, treatment with PCI as opposed to the control of MM, and define a vector of 10 baseline covariates that are salient for both treatment assignment and mortality: age, sex, two high-risk AMI subtypes, history of PCI or bypass surgery, dementia, heart failure, unstable angina, and renal failure. Under the potential outcomes framework (neyman1923application; rubin1974estimating), we denote , where is the potential outcome for patient under treatment , .

Suppose data for a total of patients are stored at independent study hospitals, where the -th hospital has sample size so . Let be a hospital indicator such that indicates that patient is in the hospital . We summarize the observed data at each hospital as where each hospital has access to its own patient-level data but cannot share this data with other hospitals. Let indicate hospitals that comprise the target population and indicate hospitals that comprise the source population.

3.2 Estimand and Identifying Assumptions

Our goal is to estimate the target average treatment effect (TATE),


where the expectation is taken over the target population covariate distribution.

To identify the mean potential outcome under each treatment in the target population, we require the following assumptions:

Assumption 1 (Consistency):

Assumption 2 (Positivity of treatment assignment):

for all and for all with positive density, i.e., .

Assumption 3 (Positivity of hospital selection):

for all with positive density in the population.

Assumption 4 (Mean exchangeability of treatment assignment):

Assumption 5 (Mean exchangeability of hospital selection):

Assumption 1 states that the observed outcome for patient under treatment is the patient’s potential outcome under the same treatment. Assumption 2 is the standard treatment overlap assumption (rosenbaum1983central)

, which states that the probability of being assigned to each treatment, conditional on baseline covariates, is positive in each hospital. This assumption is plausible in our case study because every hospital performs PCI and also renders MM, and no baseline covariate is an absolute contraindication for PCI. Assumption 3 states that the probability of being observed in a hospital, conditional on baseline covariates, is positive. This too is plausible because all patients in the study have the same insurance and none of the 51 hospitals deny admission to AMI patients on the basis of any of the baseline covariates. Assumption 4 states that in each hospital, the potential mean outcome under treatment

is independent of treatment assignment, conditional on baseline covariates. Assumption 5 states that the potential mean outcome is independent of hospital selection, conditional on baseline covariates. For a detailed discussion on similar assumptions for identification, see dahabreh2020extending.

We do not assume that , the probability density measure of , is the same across hospitals. Rather, we allow for heterogeneity in the distribution of covariates across hospitals, which can be modeled as a density ratio between the target and each source hospital . We show how to calculate the density ratio in a distributed manner, requiring only the target population of one or more hospitals to pass a vector of covariate means to the source hospitals.

3.3 Federated Global Estimator

To adaptively combine estimators from each hospital, we propose the following global estimator for the TATE,


where is the estimated mean potential outcome in treatment group using data from the target hospital , is the estimated mean potential outcome in incorporating data in source hospital , and are adaptive weights that satisfy with . In Section 3.4, we describe how to estimate treatment-specific mean potential outcomes for the target , and source hospitals , .

It is worth noting that in (2) is an estimate that leverages information from both the target and source hospitals. It can be interpreted as a linear combination of the estimators in each of the hospitals, where the relative weight assigned to each hospital is . For example, in the case of a single target hospital and source hospital , the global estimator can be written equivalently as

where determines the relative weight assigned to the target hospital and source hospital estimate. The left-hand equation makes it clear that we “anchor” on the estimator from the target hospital, , and augment it with a weighted difference between the target hospital estimator and the source hospital estimator. The right-hand equation shows the estimator re-written from the perspective of a linear combination of two estimators.

Since it is likely that some source hospitals may present large discrepancies in estimating the mean potential outcome in the target hospital, should be estimated in a data-adaptive fashion, i.e., to downweight source hospitals that are markedly different. In Section 3.5, we describe a data-adaptive method to optimally combine the hospital estimators.

Let be a propensity score model for

based on a parametric model with finite-dimensional parameter

, which can be estimated locally in each hospital by estimating a parametric model, denoted , where is a finite-dimensional parameter estimate. Let be an outcome regression model for , where is a finite-dimensional parameter that can also be estimated locally in each hospital by fitting a parametric model, denoted , where is a finite-dimensional parameter estimate.

Figure 1 provides a flowchart of the estimation procedure. For ease of presentation, the target population is depicted as a single hospital.

Figure 1: Flowchart of the estimation procedure. The target site estimates its TATE with its own data, and shares its covariate means with source sites to enable them to calculate their own TATEs. A processing site then collects the estimates to determine the hospital weights and produce the global estimate for the target site.

First, the target hospital calculates its covariate mean vector, , and transfers it to the source hospitals. In parallel, the target hospital estimates its outcome regression model and its propensity score model to calculate the TATE and likewise transfers it to the processing site. The source hospitals use obtained from the target hospital to estimate the density ratio parameter by fitting an exponential tilt model and obtain their hospital-specific density ratio, , as . In parallel, the source hospitals estimate their outcome regression models as and propensity score models as to compute . These model estimates are then shared with the processing site. Finally, the processing site computes a tuning parameter , adaptive weights , the global TATE , and CI. The workflow is outlined in Algorithm 1, and the details are described in Section 3.4 and 3.5.

Data: For hospitals,
1 for Target hospital  do
2       Calculate and transfer to source hospitals. Estimate , , , and . Calculate TATE as and transfer to processing site.
3 end for
4for Source hospitals  do
5       Solve for , calculate and transfer to processing site. Estimate , , , and transfer to processing site.
6 end for
7for processing site do
8       Calculate the TATE estimator from each source hospital as . Estimate by solving the penalized regression in (9). Construct the final global estimator as by (2

) and variance by (

10) and construct 95% CI.
9 end for
Result: Global TATE estimate, and CI
Algorithm 1 Pseudocode to obtain global estimator leveraging all hospitals

3.4 Hospital-Specific Estimators

When only the target hospital data is used to estimate the TATE, then a doubly robust estimator for the mean potential outcome in the target hospital can take the form


which has the standard augmented inverse probability weighted (AIPW) form (robins1994estimation). If the target population includes multiple target hospitals, then a weighted average can be obtained as .

To estimate using source data, the covariate shifts between the target and source hospitals need to be accounted for in order to avoid bias in the estimator. In other words, the data from source hospitals should be used in a fashion that emphasizes source hospitals that are more similar to the target hospital. We estimate the density ratio, , where denotes the density function of in the target hospital and is the density function of in source hospital . To estimate , we propose a semiparametric model between the target hospital and each source hospital , such that


where the function satisfies and . We choose where is some -dimensional basis with as its first element. This is known as the exponential tilt density ratio model (qin1998inferences). Choosing recovers the entire class of natural exponential family distributions. By including higher order terms, the exponential tilt model can also account for differences in dispersion and correlations, which has great flexibility in characterizing the heterogeneity between two populations (duan20201fast).

In a federated setting where individual-level data cannot be shared across hospitals, from (4) we observe that


Thus, we propose to estimate by solving the following estimating equations:


Specifically, we ask each target hospital to broadcast its -vector of covariate means of to the source hospitals. Each source hospital then solves the above estimating equations using only its individual-level data. Finally, the density ratio weight is estimated as . With the estimated density ratio weights , the processing site can calculate

To obtain in practice, a processing site aggregates the estimates from the target and source hospitals. The processing site can be any one of the hospitals or another entity entirely, such as a central agency or organization to which the target and source hospitals belong. The target hospital calculates the conditional mean and passes it to the processing site, and source hospital calculates the augmentation term and passes it to the processing site.

3.5 Optimal Combination and Inference

We now describe how the processing site can estimate the adaptive weights such that it optimally combines estimates of the mean potential outcome in the target hospital and source hospitals for efficiency gain when the source hospital estimates are sufficiently similar to the target estimate, and shrinks the weight of unacceptably different source hospital estimates toward so as to discard them. In order to safely leverage information from source hospitals, we anchor on the estimator from the target hospital, . When is similar to , we would seek to estimate to minimize their variance. But if for any is too different from , a precision-weighted estimator would inherit this discrepancy. By examining the mean squared error (MSE) of the data-adaptive global estimator to the limiting estimand of the target-hospital estimator, the MSE can be decomposed into a variance term that can be minimized by a least squares regression of influence functions from an asymptotic linear expansion of and , and an asymptotic bias term of for estimating the limiting estimand . More formally, define


where is the influence function for the target hospital and is the influence function for source hospital .

To estimate , we minimize a weighted penalty function,


where with and, is the estimated discrepancy from source hospital , and is a tuning parameter that determines the level of penalty for a source hospital estimate. We call this estimator GLOBAL-. Extending results from han2021federated, we prove that given a suitable choice for , then are adaptive weights such that when and when (Appendix I). Importantly, we also show that we can solve for the that minimizes the function without sharing patient-level information from the influence functions (Appendix II).

The GLOBAL- estimator may be preferable in ‘sparse’ settings where many source hospitals differ from the target hospital TATE. Furthermore, the GLOBAL- estimator has a practical advantage of ‘selecting’ peer hospitals for comparison. For example, in multi-year studies, all hospitals could be used in year one, and only those hospitals with non-zero weights can be used in ensuing years. This could result in substantial resource savings in cost and time. In settings where the TATE estimate from source hospitals have relatively low discrepancy compared to the target hospital, we would not wish to shrink the weight of any hospital to zero. In this case, hospital-level weights can minimize a penalty function where replaces in the penalty term of (9). We call this estimator GLOBAL-.

We propose sample splitting for the optimal selection of . Specifically, we split the data into training and validation datasets across all hospitals. In the training dataset, we estimate our nuisance parameters , , and and influence functions, and solve distributively for a grid of values. Using the associated weights from each value of , we estimate the MSE in the validation data. We set the value of the optimal tuning parameter, , to be the value that minimizes the MSE in the validation data.

We propose estimating SEs for using the influence functions for and

. By the central limit theorem,

where and is the limiting value of . We estimate the SE of as , where


Given the SE estimate for the global estimator, pointwise confidence intervals (CIs) can be constructed based on the normal approximation.

4 Simulation Study

We evaluate the finite-sample performance of our proposed federated global estimators and compare them to an estimator that leverages target hospital data only and two sample size-adjusted estimators that use all hospitals, but do not adaptively weight them. We examine the empirical absolute bias, root mean square error (RMSE), coverage of the CI, and length of the CI for alternative data generating mechanisms and various numbers of source hospitals, running simulations for each setting.

4.1 Data Generating Process

We examine two generating mechanisms: the dense and sparse data settings, where means that fewer source hospitals have the same covariate distribution as the target hospital, and the proportion of such source hospitals declines as the number of source hospitals increases.

To simulate heterogeneity in the covariate distributions across hospitals, we consider skewed normal distributions with varying levels of skewness for each hospital. Specifically, the covariates

are generated from a skewed normal distribution , where indexes the hospitals and indexes the covariates. is the location parameter, is the scale parameter, and is the skewness parameter. The distribution follows the density function , where

is the standard normal probability density function and

is the standard normal cumulative distribution function. Using these distributions, we examine and compare the dense

and sparse settings. In this section, we examine in detail the setting with . To study the likely impact of increasing and to show that the algorithm accommodates both continuous and binary covariates, we consider continuous covariates for , and for , we consider with two continuous and eight binary covariates. The covariate distribution of the target hospital is the same in each setting.

For the target hospital , we specify its sample size as patients. We assign sample sizes to each source hospital using the distribution

specifying that the gamma distribution have a mean of

, a standard deviation of

, and a minimum volume threshold of patients.

4.2 Simulation Settings

The true potential outcomes are generated as

where denotes squared element-wise, is a -vector of equal increments, , , and

where is a -vector of equal increments, and .

The true propensity score model is generated as

for both the target and source hospitals, with , .

We consider five different model specification settings. In Setting I with , we study the scenario where both the outcome model and propensity score model are correctly specified. Setting II differs in that we have a -vector of equal increments , so that the true and include quadratic terms, which we misspecify by fitting a linear outcome model. Setting III differs from Setting I in that we have a -vector of equal decrements , which we misspecify by fitting a logistic linear propensity score model. Setting IV includes both -vectors of equal increments and -vectors of equal decrements , which we misspecify by fitting a linear outcome regression model and a logistic linear propensity score model, respectively. Finally, Setting V has , for the target hospital and half the source hospitals , but for the remaining source hospitals, thereby misspecifying the outcome model in all hospitals and the propensity score model in half the source hospitals. For , the five settings are generated similarly. Details on the generating mechanisms are provided in Appendix III.

In the simulations, we consider all ten combinations of the model specifications and covariate density setting with total hospitals, and five estimators: 1) an estimator using data from the target hospital only (Target-Only), 2) an estimator using all hospitals that weights each hospital proportionally to its sample size and assumes homogeneous covariate distributions across hospitals by fixing the density ratio to be for all hospitals (SS naive), 3) an estimator using all hospitals that weights each hospital proportionally to its sample size but correctly specifies the density ratio weights (SS), 4) the GLOBAL- federated estimator, and 5) the GLOBAL- federated estimator.

We choose the tuning parameter from among as follows. In ten folds, we split the simulated datasets into two equal-sized samples, with each containing all hospitals, using one sample for training and the other for validation. The function is evaluated as the average across those ten splits.

4.3 Simulation Results

Throughout this section, we describe in detail the results for model specifications I–V with in the setting. We address the setting and extensions to covariates when appropriate, and report detailed numerical results for these alternative settings in the Appendix.

Table 1 reports results for with across simulations. In this setup, the GLOBAL- estimator produces sparser weights and has substantially lower RMSE than the Target-Only estimator in every setting where at least one hospital has a correctly specified model (Settings I, II, III, and V). The GLOBAL- estimator produces estimates with larger biases, but also with the lowest RMSE, with the RMSE advantage increasing with . Relative to the global estimators, the SS (naive) estimator demonstrates very large biases, while the SS estimator that incorporates the density ratio weights has less bias and lower RMSE, but has longer CIs. A notable difference in the alternative setting is that the GLOBAL- estimator produces more uniform weights and has better performance (see Appendix Table 3). In Appendix III, the distribution of patient-level observations is visualized for the , case.

Simulation scenarios
Bias RMSE Cov. Len. Bias RMSE Cov. Len. Bias RMSE Cov. Len.
Setting I
   Target-Only 0.00 0.69 98.20 3.10 0.00 0.69 98.10 3.10 0.00 0.69 98.10 3.10
   SS (naive) 0.87 0.88 24.90 1.41 0.90 0.91 1.10 0.84 0.91 0.92 0.00 0.45
   SS 0.01 0.40 99.60 2.40 0.01 0.29 99.20 1.58 0.00 0.21 96.50 0.91
   GLOBAL- 0.16 0.34 98.80 1.78 0.17 0.29 96.30 1.24 0.17 0.24 84.40 0.72
   GLOBAL- 0.05 0.50 97.30 2.04 0.05 0.46 96.20 1.69 0.04 0.49 88.60 1.45
Setting II
   Target-Only 0.00 0.72 97.60 3.18 0.00 0.72 97.50 3.18 0.00 0.72 97.50 3.18
   SS (naive) 0.87 0.89 26.40 1.44 0.90 0.91 2.10 0.86 0.91 0.92 0.00 0.46
   SS 0.01 0.40 99.50 2.44 0.01 0.29 99.20 1.60 0.00 0.21 96.30 0.92
   GLOBAL- 0.18 0.35 98.60 1.86 0.18 0.30 96.00 1.28 0.19 0.25 84.30 0.74
   GLOBAL- 0.06 0.52 97.80 2.08 0.06 0.48 95.50 1.73 0.05 0.50 87.70 1.46
Setting III
   Target-Only 0.04 0.70 96.60 3.09 0.04 0.71 96.50 3.09 0.05 0.71 96.50 3.09
   SS (naive) 0.84 0.86 28.50 1.46 0.87 0.88 2.00 0.87 0.89 0.89 0.00 0.46
   SS 0.02 0.42 99.90 2.56 0.01 0.31 99.20 1.69 0.03 0.22 97.40 0.96
   GLOBAL- 0.12 0.33 99.00 1.83 0.13 0.28 97.50 1.27 0.14 0.22 88.30 0.73
   GLOBAL- 0.01 0.50 97.20 2.06 0.02 0.47 95.20 1.72 0.01 0.50 88.20 1.45
Setting IV
   Target-Only 0.10 0.74 96.80 3.18 0.10 0.74 96.80 3.18 0.10 0.74 96.80 3.18
   SS (naive) 0.82 0.83 34.70 1.53 0.85 0.86 3.30 0.91 0.86 0.87 0.00 0.48
   SS 0.05 0.43 99.80 2.60 0.04 0.31 99.20 1.71 0.06 0.23 97.20 0.98
   GLOBAL- 0.10 0.33 99.00 1.92 0.11 0.27 97.70 1.32 0.12 0.21 91.30 0.76
   GLOBAL- 0.04 0.54 96.80 2.13 0.02 0.49 95.10 1.76 0.03 0.52 85.30 1.46
Setting V
   Target-Only 0.00 0.72 97.60 3.18 0.00 0.72 97.50 3.18 0.00 0.72 97.50 3.18
   SS (naive) 0.85 0.86 27.90 1.44 0.87 0.88 2.40 0.87 0.89 0.89 0.00 0.46
   SS 0.02 0.43 99.70 2.59 0.02 0.31 99.30 1.71 0.03 0.22 96.90 0.98
   GLOBAL- 0.17 0.35 98.90 1.87 0.17 0.29 96.30 1.29 0.18 0.24 86.30 0.75
   GLOBAL- 0.05 0.52 97.60 2.10 0.05 0.48 96.30 1.74 0.06 0.51 87.50 1.47

Abbreviations: RMSE = Root mean square error; Cov. = Coverage, Len. = Length of CI;

SS = Sample Size.

Table 1: Results from 1000 simulated datasets for covariate distribution when with varying simulation settings and numbers of source sites.

To highlight the difference in weights obtained from the different methods, we plot the weights of the GLOBAL-, GLOBAL-, and SS estimators as a function of the distance to the target hospital TATE. Figure 2 (left panel) illustrates the weights for when hospitals in the setting where . The GLOBAL- estimator places about half the weight on the target hospital and drops three hospitals entirely that have large discrepancy compared to the target hospital TATE. The SS estimator has close to uniform weights. The GLOBAL- estimator produces weights between the GLOBAL- estimator and the SS estimator. In terms of covariate imbalance, we calculated the absolute distance (where the true scale parameter is 1) between the target hospital covariate means and the weighted source hospital covariate means using the GLOBAL-, GLOBAL-, and SS estimators. Figure 2 (right panel) shows that the GLOBAL- estimator has the smallest covariate imbalances across both covariates. To illustrate these covariate imbalances, the dataset is generated 10 times with different seeds.

Figure 2: At left, we plot the distance from each source hospital to the target hospital TATE vs. hospital weights in the setting, showing how the global estimator upweights hospitals with similar TATE estimates. At right, we observe smaller covariate imbalances between the target and source hospitals when using the global estimators.

The weights for the setting are given in Appendix III. In the setting, the differences in covariate imbalances between the GLOBAL- and GLOBAL- estimators are less pronounced, and the GLOBAL- estimator is preferred over the GLOBAL- estimator due to its smaller RMSE.

5 Performance of CCE

We used our federated causal inference framework to evaluate the performance of candidate CCE. Despite this common designation, there was substantial variation in the distribution of baseline covariates across hospitals (Figure 3). For example, the proportion of patients with renal failure varied from one-fifth to one-half of patients in a hospital. To hospitals, this signifies that the common designation does not imply that a fellow CCE may be an appropriate comparator due to differences in case-mix.

Figure 3: Variation in hospital case-mix across candidate CCE.

Despite these differences, a typical sample-sized based approach places more emphasis on source hospitals with greater volume, regardless of its appropriateness as a comparator. In contrast, the global estimator ensures that target hospitals are being benchmarked in a fashion that places more emphasis on source hospitals similar to theirs. To illustrate how the global estimators differ from existing methods on weighting toward more relevant hospitals, we first show the difference in weights obtained from GLOBAL-, GLOBAL-, and SS, plotting the absolute discrepancy obtained from each source hospital TATE estimate against the weights (Figure 4). As examples, we selected three diverse hospitals to take turns serving as the target hospital. Hospital A is an urban major academic medical center with extensive cardiac technology, Hospital B is urban and for-profit, and Hospital C is rural and non-teaching.

Figure 4 shows that the SS weights are close to uniform, ranging from to . Indeed, regardless of which hospital serves as the target hospital, the SS weights are the same in each case, showing an inability to adapt to the specified target hospital. Thus, despite potential systemic differences in the types of patients served by these three very different hospitals, SS-based estimators are indifferent to this variation.

In contrast, the GLOBAL- estimator places more weight on hospitals that are closer to the target hospital TATE and ‘drops’ hospitals once a threshold discrepancy is crossed. In so doing, the GLOBAL- estimator makes a data-adaptive bias-variance trade-off, reducing variance and increasing the effective sample size at the cost of introducing slight discrepancy in estimates. Therefore each of these three different hospitals not only benefits from a gain in estimation precision of its own performance, but is also reassured that the source hospitals providing that precision gain were more relevant bases for comparison.

The GLOBAL- estimator produces weights in between the GLOBAL- weights and the SS weights. This follows the expected relationship outlined in Section 3, as the method for obtaining GLOBAL- emphasizes retaining all hospitals in the analysis, albeit while still placing additional weight on source hospitals that are more similar to the hospitals. An example of covariate imbalance across the three estimators is given in Appendix IV.

Figure 4: The global estimators place more weight on source hospitals with more similar TATE estimates, whether the target hospital is an urban major teaching hospital (A), an urban for-profit hospital (B), or a rural non-teaching hospital (C).

As noted earlier, estimators that only use the target hospital’s own data often lack power to distinguish treatment effects, potentially leading hospitals to misinterpret their performance. Thus, the appeal of the federated causal inference framework is that it helps the target hospital estimate its treatment effects more precisely. To demonstrate the efficiency gain from using a global estimator to estimate the TATE, we plotted the TATE estimate for each hospital using the target-only estimator (left panel), the GLOBAL- estimator (middle panel), and overlaying the target-only estimator, GLOBAL- estimator, GLOBAl- estimator, and SS estimator (right panel) (Figure 5). In the figure, each hospital takes its turn as the target hospital, with the other 50 hospitals serving as potential source hospitals. As can be seen, the GLOBAL- estimator yields substantial variance reduction for each target hospital (compared to the target-only estimator) while introducing minimal discrepancy (compared to the GLOBAL- and SS estimators). Due to this efficiency gain, the qualitative conclusion regarding the mortality effect of PCI treatment relative to MM changes for () of the hospitals. In other words, the choice of estimator can have important implications for performance evaluation and strategic decision-making.

Figure 5: TATE estimates for all hospitals show substantial precision gain of GLOBAL- compared to the Target-Only estimator and better accuracy relative to the GLOBAL- and SS estimators.

The precision gain for each hospital using GLOBAL- compared to the Target-Only estimator is substantial, with a median reduction in the TATE SE, ranging from to (Figure 6). Moreover, this precision gain was not accompanied by a noticeable loss of accuracy. As can be seen in Figure 5, GLOBAL- has a smaller discrepancy to the Target-Only estimates relative to the GLOBAL- and SS estimators. Taken together, these findings show that the GLOBAL estimator can provide hospitals with more precise yet still accurate guidance on their performance. Note also that while the GLOBAL- estimates are not as accurate as the GLOBAL- estimates due to the sparse setting, they still demonstrate substantial precision gains over the Target-Only estimator. Therefore, the GLOBAL- estimator should be a useful approach in denser settings.

Figure 6: The proportion of each vertical line that is in color represents the percent reduction in SE using the GLOBAL- estimator for each hospital’s TATE estimate. Red signifies a change in interpretation from no treatment effect to a significant treatment effect.

In addition to ranking hospitals based on their TATE performance, we also used the GLOBAL- estimator to rank hospitals on their performance had all patients received PCI , or had all patients received MM . In so doing, we provide comprehensive guidance on specific AMI treatments, information that can be useful both to hospitals and prospective patients. Figure 7 shows treatment-specific hospital mortality estimates for PCI and MM, with hospitals sequenced from the lowest (best) to highest (worst) PCI mortality. The figure shows that PCI patients had lower mortality than MM patients in nearly all hospitals, but at the low end of PCI performance, their PCI mortality rates could not be distinguished from their MM mortality rates. Indeed, the MM mortality estimate was in fact lower than the PCI mortality estimate in the hospital with the lowest PCI ranking, although again, the CIs overlapped. Note that no hospital fared exceptionally well on both PCI and MM ranking. In fact, although the hospitals that ranked highly on PCI mortality had similar PCI mortality estimates to each other, their MM mortality estimates were highly variable. These findings suggest that the staffing, skill, and resource inputs that translate to better performance in interventional cardiology differ from those that drive excellent MM practices.

Figure 7: The treatment-specific GLOBAL- hospital mortality estimates and CIs show that hospitals that ranked highly on PCI mortality tended not to rank as well on MM and vice versa; moreover, MM mortality estimates varied considerably even among hospitals with similar PCI performance.

6 Discussion

We developed a federated and adaptive method that leverages summary data from multiple hospitals to safely and efficiently estimate target hospital treatment effects for hospital quality measurement. Our global estimation procedure preserves patient privacy and requires only one round of communication between target and source hospitals. We used our federated causal inference framework to investigate quality among candidate CCE in the U.S. We obtained accurate TATE estimates accompanied by substantial precision gains, ranging between and relative to the estimator using only target hospital data.

Our global estimator can be used in other federated data settings, including transportability studies in which some hospitals have access to randomized trial data and other hospitals have only observational data. In this setting, one could anchor the estimates on a hospital with randomized trial data and enhance the data with observational data from other hospitals. Alternatively, if one were primarily interested in transporting causal estimates to an observational study, one could treat the observational study as the target study and use the randomized trial as a source study within our framework.

Importantly, our method enables a quality measurement framework that benchmarks hospitals on their performance on specific alternative treatments for a given diagnosis as opposed to only aggregate performance or an isolated treatment. This comprehensive analysis enabled us to show that among patients admitted for AMI, superiority in one treatment domain does not imply success in another treatment. Equipped with these methods and results, hospitals can make informed strategic decisions on whether to invest in shoring up performance on treatment-specific outcomes where they are less successful than their peers, or to allocate even more space and resources to treatment paradigms where they have a comparative advantage. Our treatment-specific assessments reflect the nuance that different hospitals sometimes have comparative advantages and disadvantages, and in so doing, these separate rankings can inform hospitals on which forms of treatment may require quality improvement efforts.

Moreover, this information may also be useful to patients. Specifically, if a patient has a strong preference for one treatment over another, they can opt for a hospital that excels at their preferred course of treatment. This feature of the framework may be even more useful in elective clinical domains where patient agency over facility choice is not diminished by the need for emergent treatment, such as oncology, where treatment options vary substantially and are carefully and collaboratively considered by physicians and their patients. For example, many solid-tumor cancers are treated with a typically elective surgery followed by adjuvant therapy in the form of either chemotherapy, radiation, or a combination of both. However, some patients may exhibit a strong and justifiable aversion to a particular therapy due to concerns about side effects. For example, if a patient requires cancer resection followed by adjuvant therapy but does not want chemotherapy, they can instead opt for treatment at a cancer center that showcases strong performance when rendering surgery with adjuvant radiation.

The limitations of our approach present opportunities for future advancements. In the multi-hospital setting, even if a common set of covariates are universally known and acknowledged, it may be difficult to agree upon a single functional form for the models. Therefore, researchers in different hospitals might propose different candidate models. For example, treatment guidelines can differ across CCE based on variation in patient populations or the resources at each hospital’s disposal. For increased robustness, han2013estimation developed multiply robust estimators, wherein an estimator is consistent as long as one of a set of postulated propensity score models or outcome regression models is correctly specified. Multiply robust methods may be particularly appealing in federated data settings, where different researchers at different hospitals are likely to have different models when forming propensity score and outcome regression models.


The online supplementary materials include derivation of influence functions, proofs, and additional results from the simulation study and real data-analysis. We also provide code to replicate all results in the paper.