1 Introduction
Multicenter, federated causal inference is of great interest, particularly when studying novel treatments, rare diseases, or in times of urgent health crises. For example, the COVID19 pandemic has highlighted the need for novel approaches to efficiently and safely evaluate the efficacy of novel therapies and vaccines, leveraging data from multiple healthcare systems to ensure the generalizability of findings. Over the past few years, many research networks and data consortia have been built to facilitate multisite studies and have been actively contributing to COVID19 studies, including the Observational Health Data Sciences and Informatics (OHDSI) consortium
(hripcsak2016characterizing) and the Consortium for Clinical Characterization of COVID19 by EHR (brat2020international).Analyzing data collected from multiple healthcare systems, however, is highly challenging for several reasons. First, various sources of heterogeneity exist in terms of differences in the underlying population of each data set and policy level variations of treatment assignment. Since the treatment effect may differ across different patient populations, it would be of interest to infer the causal treatment effect for a specific target population. However, the presence of heterogeneity and potential model misspecification poses great difficulty to ensure valid estimates for the target average treatment effect (ATE). Furthermore, patient level data typically cannot be shared across healthcare centers, posing an additional challenge. To overcome these challenges, we propose a federated causal inference framework to estimate the ATE for the target population, while leveraging heterogeneous data from other source populations, and accounting for data sharing constraints though a privacypreserving and communicationefficient federated learning strategy.
Most existing literature on federated learning has been focused on regression and classification models (chen2006regression; li2013statistical; chen2014split; lee2017communication; lian2017divide; wang2019distributed; duan2020learninga). Limited federated learning methods currently exist to make causal inference with multiple heterogeneous studies. Recently, xiong2021federated
proposed federated inverse probability weighted (IPW) estimation of the ATE, where the estimand is the ATE for the entire population rather than a target population. The ATE considered in
xiong2021federated is a special case of our setting when the target population is set to be the entire population. Although xiong2021federatedprovided multiple methods for point estimation and variance estimation, the choice of the proper method depends on prior knowledge about model homogeneity and model specification, which are usually unknown in practice. In addition, no empirical study in
xiong2021federated was provided to test the robustness of the approach to the covariate shift assumption. vo2021federatedproposed a Bayesian approach that models potential outcomes as random functions distributed by Gaussian processes. Their focus is also on the population ATE rather than a particular target population, and their approach requires specifying parameters and hyperparameters of Gaussian processes and modeling correlations through kernel functions, which may be numerically intensive. Compared to these approaches, our approach estimates the ATE in a particular target population, and accounts for the potential heterogeneity in the covariate distributions across populations. Most importantly, our approach additionally safeguards against the case where incorporating source data can lower the performance of the method compared to using target data alone, known as negative transfer
(pan2009survey).Another related strand of literature concerns the generalizability and transportability of randomized clinical trials to EHR studies. For example, stuart2011use; stuart2015assessing; stuart2018generalizability assessed the generalizability of results from randomized trials to target popoulations of interest. dahabreh2020extending; dong2020integrative, and josey2020calibration all focused on extending inferences about treatments from a randomized trial to a new target population by using different weighting schemes. For a comprehensive review of statistical methods for generalizability and transportability, see degtiar2021review. However, to date, no literature in generalizability and transportability has sought to leverage observational data from a potentially large number of source sites in a dataadaptive manner to obtain unbiased, efficient, and robust estimation of target treatment effects.
To incorporate heterogeneous data from multiple sites to make inference about the target average treatment effect (TATE), we develop a federated adaptive causal estimation (FACE) framework. Our major contributions can be summarized as follows. First, our framework allows for flexibility in the specification of the target population (e.g., a single site, some sites, or all sites). Second, our method accounts for sitelevel heterogeneity in the distribution of covariates through a density ratio weighting approach, such that each source site can obtain a doubly robust estimate of the TATE. Third, to safely aggregate estimates from all sites and avoid negative transfer, we introduce an adaptive method that anchors on the target population estimate and then estimates dataadaptive weights to combine source site estimates. Compared to the estimator using only target population data, our proposed global estimator adaptively combines data from all populations to meaningfully increase estimation precision without introducing bias. In addition, we propose a communicationefficient and privacypreserving federated algorithm to obtain the doubly robust estimator, which only requires each participating site to share information once.
The remainder of the paper is organized as follows. In Section 2, we introduce the problem setting, notation, and assumptions required for identification of the TATE. In Section 3, we declare the proposed FACE framework for estimating the TATE. We introduce the insite estimators based on the target population and source population separately in Sections 3.1 and 3.2 and present the adaptive and distributed integration in Section 3.3. In Section 4, we provide the theoretical guarantees of the method, including double robustness and asymptotic normality. In Section 5, we conduct extensive simulations for various numbers of sites, sample sizes, and show robustness to misspecification of different models. In Section 6, we apply FACE to conduct a comparative effectiveness study of COVID19 vaccines using the EHRs from five federated VA sites. We conclude in Section 7 with key takeaways and directions for future research.
2 Setting and Notation
For the th observation, we denote the outcome as , the
dimensional baseline covariate vector as
, and the indicator for binary treatment as . Under the federated learning setting, a total of observations are stored at study sites, where the th site has sample size , and . Let be a site indicator such that indicates the th patient in the th site. Indexing the site by a single integer , we assume that each observation may only belong to one site. We summarize the observed data at each site as where each site has access to its own patientlevel data but cannot share this data with other sites. We denote the index set for each site as and the data in the target sites as . Let be a subset of , indicating sites that are in the target population, and containing sites that are in the source population. We denote the total sample size of the target sites as . For simplicity of notation, we use without subscripts to state general assumptions and conclusions.Under the potential outcomes framework (neyman1923application; rubin1974estimating), we denote as the potential outcome of patient under treatment , . Our goal is to estimate the target average treatment effect (TATE) for a specified target population,
(1) 
where the expectation is taken over the covariate distribution in the target population. The target population can be specified at multiple levels corresponding to different targets of realworld interest. When the target of interest includes patients across all sites, i.e., , then the TATE is simply the ATE. This corresponds to the special case that is studied by xiong2021federated. When the target population includes some of the sites, leveraging source sites is useful as it can allow for greater efficiency in the estimation of the TATE.
To identify the TATE, we make the following standard assumptions throughout the paper: For an absolute positive constant , , and ,

[label = (), ref = 2()]

Consistency: .

Overlapping of treatment arms: , .

Overlapping of site populations: , .

Ignorability: , .
Assumption 1 states that the observed outcome for patient under treatment is the patient’s potential outcome under the same treatment. Assumptions 2 and 3 are the standard overlap assumptions (rosenbaum1983central), which state that the probability of being assigned to each treatment/site, conditional on baseline covariates, is positive in each site. Here we allow each site to have a sitespecific propensity score. Assumption 4 can be derived from the typical ignorability condition (imbens2015causal; hernan2020causal). For further discussion on assumptions for identification, see dahabreh2020extending.
Note that Assumption 4 is only made for target sites . We make no stringent requirement for among the source sites . In the worst case scenario, data from a source site can be totally irrelevant to such that the estimator from the site is severely biased. To prevent negative transfer from such cases, we devise an adaptive selection and aggregation for the source sites. We denote the models for the sitespecific propensity scores and outcome regressions:
(2) 
Since the distribution of the covariates may be heterogeneous across sites, we characterize the difference in covariate distributions between the target site(s) and the th source site through the density ratio
We choose flexible semiparametric models for the density ratio
(3) 
where is a function with an intercept term.
The model (3) is known as the exponential tilt density ratio model, which has wide application in areas such as mixture models to account for heterogeneity between two distributions (qin1998inferences; qin11; HongJASA2017; duan20201fast). For example, qin11 studied the exponential tilt model with
, which recovers the whole class of natural exponential family distributions, including the normal distribution with mean shift, Bernoulli distribution for binary covariates, etc. By including higherorder terms of
in, higherorder differences such as variance, skewness, and covariance can be captured. An important contribution in our paper is that we propose a privacypreserving and communicationefficient approach to estimate the heterogeneity in covariate distributions, which we will describe in more detail in Section 3.
3 Method
Staring with an overview, we illustrate the workflow of the FACE framework in Figure 1. In step 1, an estimate will be locally constructed within the th target site, and these estimate(s) will be shared with a leading site. In the meantime, the th target site will also calculate summary statistics and broadcast them to the source sites, which will later be used for estimation of the density ratio model. In Step 2, each source site will construct a sitespecific estimate for the TATE, , and pass it to the leading site. In Step 3, a leading site processes the aggregation with estimators and parameters from Steps 1 and 2 to obtain the final FACE estimator, . Overall, each site is only required to share information one time with other sites. We provide detailed explanations for each step of FACE, including the summarylevel information that must be shared across sites, in Sections 3.13.3
with generic models. An example with logistic regression models is given in Section
3.4.3.1 Step 1: Estimation Using Target Data
The initial doubly robust TATE estimator is obtained from the sitespecific ATE of the target site(s). For each target site , we estimate the insite propensity score and outcome models and using classical regression models and methods such as maximum likelihood estimation or estimating equations.
We compute the insite contribution to the doubly robust TATE, from its plugin component and augmentation component , as well as summary characteristics for the covariate distribution from the target sites, :
(4) 
Once , , , , and for are shared, each site can compute the doubly robust TATE, its augmentation component, and the mean covariate vector from the target sites:
When contains multiple sites, we also share the estimators for the variancecovariance of , which we denote as . Variance estimation can be conducted through classical influence function or bootstrap within site. The role of the matrix in the aggregation will be explained in (10).
The initial TATE estimator is the doubly robust estimator with sitestratified propensity score and outcome regression:
By bang2005doubly, is consistent if either the propensity score or outcome regression is consistently estimated for each . Mixandmatch is allowed, i.e. the propensity score is consistent for some target sites while the outcome regression is consistent for the rest. If a common outcome regression model were assumed, then could be estimated jointly for additional efficiency gain.
3.2 Step 2: Estimation Using Source Data
To assist in estimating using source data, the covariate shifts between the source sites and the target population need to be adjusted for in order to eliminate potential biases. We achieve this goal by tilting the source population to the target population through the density ratio model introduced in (3). When all data are pooled together, estimating can be achieved by constructing a pseudolikelihood function as described in qin1998inferences. In a federated setting, we can only use obtained from the target sites and estimate the density ratio by solving the estimating equation
(5) 
The above estimation is done completely within the th source site. Together with the insite propensity score and outcome regression models , we construct the sitespecific augmentation,
(6)  
(7) 
Then, the sitespecific augmentations are shared back to the leading site. The source sites also share , an estimate for the conditional variance and , an estimate for the partial derivative of with respect to , whose role in the aggregation will be explained in (10). Both and can be constructed from classical influence functions, which we provide in the Supplement. Alternatively, may be estimated by bootstrapping within site and may be estimated by numerical derivatives.
Combining the sitespecific augmentation with the plugin component of the initial TATE estimator, we obtain the sitespecific doubly robust estimator
The estimator is doubly robust in the following sense: either (1) the outcome regression is consistent for the target site(s) and source sites, or (2) the propensity score and density ratio are consistent for source sites while the outcome regression estimator is shared across all sites.
3.3 Step 3: Adaptive Aggregation
In the final step, we obtain our FACE estimator by adaptively aggregating the initial TATE estimator and sitespecific estimators. The leading site can estimate as
(8) 
with the weights to be estimated in a dataadaptive fashion to filter out potentially biased sitespecific estimators. The second expression of in (8) shows how the shared parameters from Steps 1 and 2 are used in the construction.
The FACE estimate is a linear combination of and . It can also be interpreted as an “anchor and augmentation” estimator. We may alternatively express (8) as
The alternative expression makes it clear that we “anchor” on the estimator from the target site, , and augment the anchor estimator with a weighted difference between the target and source site estimates. In general, since it is possible that source sites may present empirical bias in estimating the TATE, should be estimated in a dataadaptive fashion to safeguard against biased estimates, i.e., to down weight or exclude likely biased source site estimates of the TATE.
Moreover, the aggregation of the remaining unbiased sitespecific augmentations should also minimize variance. Under the federated learning setting, the key to evaluate the variance of (8) is to decompose it into contributions from separate sites so that they can be estimated and shared within each site. We consider the following decomposition
(9) 
where is the limit for , which is the partial derivative of with respect to . We decouple the dependence of the source site augmentation on the target site(s) through by subtracting the first order approximation of the dependence . The resulting is asymptotically independent of the target site(s).
Motivated by (9), we propose the adaptive penalized optimal aggregation
(10) 
with from Step 1 and and from Step 2. The multiplicative factor is required to stabilize the loss, as the variance components and decay to zero at rate. Choosing with , we achieve an oracle property for the selection and aggregation: (1) the biased sitespecific augmentations have zero weights with large probability; (2) the regularization on the weights for unbiased sitespecific augmentations is asymptotically negligible ().
Using the variance estimator (stabilized by “” factor likewise)
(11) 
and the quantile for standard normal distribution we construct the confidence interval
(12) 
The full FACE workflow is summarized in Algorithm 1.
The solution of (10) only involves a small number of summary statistics that grows proportionally to the dimension of for each site. Therefore, our aggregation is communicationefficient and privacyprotected. Classically, procedures like (10) are usually constructed with influence functions. Let , and be the estimated influence functions for , and , respectively. The adaptive selection and aggregation weight can be obtained by solving the following penalized regression
We propose a novel sample splitting approach for the optimal selection of without sharing individuallevel data. We split the data into training and validation datasets across all sites. In the training dataset, we estimate our nuisance parameters , , and and influence functions, and solve distributively for a grid of values. Using the associated weights from each value of , we estimate the MSE in the validation data. We set the value of the optimal tuning parameter, , to be the value that minimizes the MSE in the validation data.
3.4 FACE Under Logistic Regression Models
We showcase FACE with the following realization under logistic regression models. Let be a binary outcome, and . For notational ease, let be the vector of covariates with an intercept term. We fit logistic regression models with link and loss for all propensity score and outcome regression models. The mapping density ratio model is set as the identity map, , which corresponds to a multinomial logistic regression model on the site indicator .
In Step 1, we first calculate the mean covariate vector in the target site as and transfer it to sites 2 through 5. Then, we estimate the models for
Using the estimated models, we obtain the initial estimator and its augmentation
and . The variance covariance matrix estimator can be calculated as through the estimated influence functions, where and the exact form of and are given in the Supplement.
In Step 2, we estimate the models for
Using the estimated models, we obtain the the sitespecific augmentations
The partial derivative of with respect to is estimated by
The variance estimator can be calculated as through the estimated influence function, where the form of is given in the Supplement.
In Step 3, we use , , , and to solve the adaptive selection and aggregation (10), which leads to the FACE estimator and the confidence interval .
4 Theoretical Guarantees
In this section, we provide the theoretical results for our target only estimator, source site estimators, and adaptive combination estimator. We start with a highlevel theory for a generic choice of models in Section 4.1 and give in Section 4.2 a detailed set of conditions corresponding to the realization of Section 3.4. In our asymptotic theory, the total sample size may grow but the distribution for a single copy of data , including the number of sites, stays the same.
4.1 Theory for General FACE
For the flexible FACE framework, we introduce a set of highlevel regularity conditions. For absolute constants ,

[label = (), ref = 4.1()]

(Regularity of TATE estimators) The estimators , and admit the following approximations by influence functions
with asymptotic limits , , ,
and mean zero random variables
, , . 
(Stable variance) The variances of , , are in the set . The covariates in the density ratio are bounded almost surely and have stable variance . The variancecovariance matrix
has eigenvalues all in
. 
(Regularity of auxiliary estimators) The estimators , , are consistent

(Double robustness) For each target site , at least one of the two models is correctly specified:

[label = , ref = 4]

the propensity score model is correct: for some and all , .

the outcome regression model is correct: for some and all , .

are the typical regularity conditions under classical parametric models. They can be verified in two steps: 1) asymptotic normality of model estimators
(van2000asymptotic) and 2) local expansion of the estimators. Assumption 2 regulates the scale of variability of the data, which leads to a stable variance for . Assumption 4 ensures identification of the true TATE by anchoring on (bang2005doubly). Note that in the setting of multiple target sites, Assumption 4 allows for each target site to have different correct model specifications for either the OR model or the PS model.We now state the theory for the general FACE estimation. Under Assumptions 2 and 4.1, the FACE estimator is consistent and asymptotically normal with consistent variance estimation ,
We use for convergence in distribution. Base on Theorem 4.1, we justify that the interval estimation (12) provides asymptotically honest coverage. Under Assumptions 2 and 4.1, the coverage rate of the confidence interval (12) approaches the nominal level asymptotically
A key step in the proof of Theorem 4.1 is the analysis of the penalized adaptive selection and aggregation (10). The problem is different from the typical penalized regression, so we develop a new proof strategy. We first analyze the optimal combination with oracle selection, in which the biased augmentations are excluded. For unbiased augmentations, , so the penalty term is asymptotically negligible . Thus, the estimated combination converges to the asymptotic limit at the regular rate. Finally, we show that the estimated combination with oracle selection also solves the original problem with large probability.
For consistency of , we require that the propensity score or outcome regression model is correct for the target site so that the initial doubly robust TATE estimator is consistent. Even if the models for source site and density ratio are misspecified, we can safely aggregate the sitespecific augmentations by the adaptive selection in (10). To meaningfully leverage information from source sites for TATE, we would expect that many among source sites

satisfy the ignorability condition 4;

share the same outcome regression model with the target sites

either the outcome regression model is correct, or both the propensity score and the density ratio models are correct.
For source sites satisfying the conditions above, their sitespecific augmentations are unbiased and thus contribute to the efficiency improvement of .
4.2 Application for the Realization
For the realization in Section 3.4, we denote the asymptotic parameters as