The growth of availability and variety of clinical data has induced the trend of multicenter research (Sidransky et al., 2009). Multicenter research confers many distinct advantages over single-center studies, including the ability to study rare exposures/outcomes that require larger sample sizes, accelerating the discovery of more generalizable findings, and bringing together investigators who share and leverage resources, expertise, and ideas (Cheng et al., 2017). Since individual-level information is often protected by privacy regularities and rules, directly pooling data across multiple clinical sites is less feasible or requires large amount of operational efforts (Barrows Jr and Clayton, 1996). As a consequence, healthcare systems need more effective tools for evidence synthesis across clinical sites.
Distributed algorithms, also known as the divide-and-conquer procedures, have been applied to multicenter studies. In the classical divide-and-conquer framework, the entire data set is split into multiple subsets and the final estimator is obtained by averaging the local estimators computed using the data from each subset (Li et al., 2013; Chen and Xie, 2014; Lee et al., 2017; Tian and Gu, 2016; Zhao et al., 2016; Lian and Fan, 2017; Battey et al., 2018; Wang et al., 2019). The class of methods adopts the same principle as the meta-analysis in the area of evidence synthesis and systematic review, where the local estimates are combined through a fixed effect or random effects model (DerSimonian and Laird, 1986). When the number of research sites is relatively small, these averaging type of methods are able to perform equally well as the combined analysis using data from all the sites (Hedges, 1983; Olkin and Sampson, 1998; Battey et al., 2018). When the number of research sites is large, as we will demonstrate in the simulation studies, these averaging methods may not be as good as the combined analysis. More importantly, when studying rare conditions, some clinical sites do not have enough number of cases to achieve the asymptotic properties. In such cases, the averaging methods can be suboptimal.
Recently, Wang et al. (2017) and Jordan et al. (2018) proposed a novel surrogate likelihood approach, which approximates the higher order derivatives of the global likelihood by using the likelihood function in a local site. This method has low communication cost and improves the performance of the average method especially when the number of sites is large, see Duan et al. (2019) for a real data application to pharamcoepidemiology. From the practical perspective, the surrogate likelihood approach endowed a highly feasible framework for sharing sensitive data in collaborative environment, especially in biomedical sciences, where the lead investigators often have access to the individual-level data in their home institute, and the collaborative investigators from other sites are willing to share summary statistics but not individual-level information.
Most of the aforementioned distributed algorithms assumed that the data at different sites are independently and identically distributed. However, a prominent concern in multi-center analysis is that there may exist a non-negligible degree of heterogeneity across sites because the samples collected in different sites may come from different sub-populations and environments. One concrete example is the Observational Health Data Sciences and Informatics consortium, which contains over 82 clinical databases from over 20 countries around the world (Hripcsak et al., 2015). The amount of heterogeneity cannot be ignored when implementing distributed algorithms in such healthcare networks.
In this paper, we aim to fill this methodological gap by extending the distributed algorithms to account for the heterogeneous distributions via allowing site-specific nuisance parameters. In particular, we propose a density-ratio tilted surrogate efficient score approach which only requires the individual-level data from a local site and summary statistics from the other sites. To reduce the influence of estimation of the site-specific nuisance parameters, we propose to use the efficient score function for distributed inference rather than the likelihood as in Jordan et al. (2018). We further adjust for the degree of heterogeneity by applying a novel density ratio tilting method to the efficient score function. We refer the resulting score function as the surrogate efficient score function. The estimator is defined as the root of this function. We show that the communication cost of the proposed algorithm is of the same order as Jordan et al. (2018) assuming no heterogeneity and therefore is communication-efficient. We further establish the nonasymptotic risk bound of the estimator and its limiting distribution in the two-index asymptotic setting. Under mild conditions, the asymptotic variance of the proposed estimator attains the Cramér-Rao lower bound.
To the best of our knowledge, Zhao et al. (2016) is the only work in this area that considers a similar heterogeneous setting. They generalized the divide-and-conquer approach by averaging all the local estimators and studied theoretical properties under the partially linear model. Different from this work, we propose a surrogate efficient score approach under a general parametric likelihood framework. Theoretically, we show that our estimator approximates the global maximum likelihood estimator with a faster rate than the simple average estimator; see Remarks 3 and 4. From the inference perspective, our estimator attains the Cramér-Rao lower bound whereas the simple average estimator has larger asymptotic variance and is not efficient; see Remark 6. We also show that the proposed estimator outperforms the simple average estimator in numerical studies.
2 Wang et al. (2017) and Jordan et al. (2018)’s surrogate likelihood approach for homogenous distributions
. Consider a general parametric likelihood framework, where the random variablefollows the density function indexed by a finite dimensional unknown parameter . In the distributed inference problem, we suppose there are different local sites. Denote to be the -th observation in the -th site. For notational simplicity, we assume that each site has equal sample size . The existing works on distributed inference such as Wang et al. (2017) and Jordan et al. (2018) further assume that all the observations are independently and identically distributed across sites, . Under this assumption, the combined log likelihood function can be written as
where is the log-likelihood function obtained at each site. Due to the communication constraint and privacy concerns, one cannot directly combine data across multiple sites to compute the maximum likelihood estimator. Motivated by the following Taylor expansion of the combined likelihood function around some initial value ,
Wang et al. (2017) and Jordan et al. (2018) proposed to construct a surrogate likelihood function by approximating all the higher-order derivatives in equation (2.1) using the individual-level data in one of the sites (such as the first site). When the data are identically and independently distributed across sites, it holds that for any , where is the log-likelihood at the first site. Thus, is an asymptotically unbiased surrogate of . Replacing with , the communication of the higher-order derivatives across sites can be avoided. Hence, by replacing with , which also equals to and dropping the terms independent of , the surrogate likelihood is defined as
3 Surrogate efficient score method for heterogeneous distributions
We consider a heterogeneous setting by assuming the -th observation in the -th site satisfies
where the unknown parameter can be decomposed into . In this partition, is a -dimensional parameter of interest assumed to be common in every site, which is the main motivation of evidence synthesis, and the -dimensional nuisance parameter is allowed to be different across sites. The true value of is denoted by .
If all patient-level data could be pooled together, the combined log-likelihood function is obtained by
where and . In a distributed setting, the surrogate likelihood approach reviewed in Section 2 is not directly applicable due to the following two reasons: the higher order derivatives of the log likelihood function in any site is a biased surrogate of the corresponding higher order derivatives of , and the total number of nuisance parameters increases with sample size if we allow to increase with .
To extend the surrogate likelihood approach to incorporate the site-specific nuisance parameters, we propose to use the efficient score function as a way of reducing the influence of the less accurate estimation of the site-specific . More specifically, the efficient score function for is defined as
where and are the corresponding submatrices of the information matrix in the -th site, i.e., .
Instead of constructing a surrogate likelihood function, we aim to construct a surrogate efficient score equation using individual-level data from Site 1, and summary-level data from the other sites. To reduce the complexity of the problem, we first consider an ideal situation where we know the true parameter value . Using the key idea of the surrogate likelihood approach, we aim to construct a function based on the samples in the first site such that
holds for any , where we use to denote the expectation with respect to the distribution ,
to denote the expectation with respect to the the joint distribution of the full data, andto denote the true value of . Denote . We observe that the right hand side of equation (3.1) can be written as
However, the function only involves samples in the first local site, which follows the distribution different from for . To achieve equation (3.1), we propose to construct by using the density ratio tilting method
where the density ratio is the adjustment that accounts for the heterogeneity of the distributions. We show in Lemma S.11 of the Supplementary Material that holds for any and observation in the first local site.
Since depends on the unknown parameters , and the information matrix , we plug in some initial estimators and and obtain
where and are the submatrices of defined as
which is obtained by applying the density ratio tilting method again.
Denote . Inspired by the definition of the surrogate likelihood defined in equation (2.2), we define the surrogate efficient score function as
where and .
Recall that is constructed based on the samples in Site 1. Thus the surrogate efficient score only requires to transfer a
-dimensional score vectorfrom each site together with some initial estimators.
The surrogate efficient score estimator is obtained by solving the following equation for within Site 1,
In Section 4, we show that the estimation accuracy of the above estimator can be further improved by iterating the above surrogate efficient score procedures. The method is summarized in the following algorithm. The estimator defined in equation (3.2) is equivalent to the estimator with in the following algorithm, which is also known as a oneshot procedure.
From an implementation point of view, the broadcast step (line 3) in the above algorithm can be done by transferring from each site to Site 1, and Site 1 returns the initial estimator to all the sites. It can also be done by uploading all to a shared repository and obtaining at each site. The initial estimator is chosen as a weighted average of the local estimators . When for all , is the simple average estimator (Zhao et al., 2016). We can also choose to be the sample size of each site in the unbalanced design. When is chosen as the inverse of the estimated variance of , the resulting estimator is referred to as the fixed effect meta-analysis estimator. In this paper, we simply choose . The total communication cost per iteration is to transfer numbers across all sites, where is the dimension of . Comparing to the homogeneous setting, the communication cost is of the same order as Jordan et al. (2018), and is communication-efficient.
To further reduce the computational complexity of solving the surrogate efficient score function, we can approximate the combined efficient score function via one-step Taylor expansion,
First, the property of the efficient score implies so that the last term can be neglected. Next, we replace the Hessian matrix computed by pooling over all the samples with the local surrogate , where . The resulting linear approximation as an estimating function of defines the following estimator
If we treat as an initial estimator, the above estimator can be also viewed as a one-step estimator with a local surrogate of the Hessian matrix.
4 Main Results
In this section, we study the theoretical properties of the surrogate efficient score estimator obtained from Algorithm 1. For convenience, we use and to denote positive constants which can vary from place to place. For sequence and , we write () if there exist a constant such that () for all . We first introduce the following assumptions.
The parameter space of , denoted by , is a compact and convex subset of . The true value is an interior point of .
Assumption 2 (Local Strong Convexity).
Define the expected second-order derivative of the negative log likelihood function to be . There exist positive constants , such that for any , the population Hessian matrix satisfies
where is the dimensional identity matrix. Here, we use the notation that
dimensional identity matrix. Here, we use the notation thatfor two matrices and if is positive semi-definite.
For all and , all components in and are sub-exponential random variables.
Assumption 4 (Identifiability).
For any , we denote . The parameter is the unique maximizer of .
Assumption 5 (Smoothness).
For each , let , and define
Define for some radius . There exist some function and , where and are sub-exponentially distributed for all
are sub-exponentially distributed for alland , such that for any and , we have
And for any , , , , we have
Assumptions 1, 2, 4, and 5 are standard assumptions in the distributed inference literature; see Jordan et al. (2018)
. Assumption 3 is a general distributional requirements of the data, which covers a wide range of parametric models.
When all individual-level data can be pooled together, the global maximum likelihood estimator is considered as the gold standard in practice. The asymptotic property of the global estimator has been studied in Li et al. (2003) under the asymptotic regime . Our first result characterizes a non-asymptotic bound for the distance between the global maximum likelihood estimator and the true parameter value .
Under Assumptions 1-5, the global maximum likelihood estimator satisfies
for some positive constants and not related to and .
To the best of our knowledge, this is one of the first nonasymptotic results on the rate of convergence of the maximum likelihood estimator in the presence of site-specific nuisance parameters. Under the classical two-index asymptotics setting, we allow the number of sites to grow with the sample size . As a result, the dimension of nuisance parameters also increases with . Let denote the total sample size. This lemma implies that the convergence rate of is of order when , which attains the optimal rate of convergence with known nuisance parameters. However, when , the estimator has a slower rate . In particular if is fixed, the global maximum likelihood estimator is no longer consistent, which is known as the Neyman-Scott problem (Neyman et al., 1948).
In the following, we characterize the difference between the proposed estimator and the maximum likelihood estimator . We first focus on the estimator defined in equation (3.2), which is identical to in algorithm 1 with the number of iterations .
Suppose Assumptions 1-5 hold. In Algorithm 1, if the number of iterations , assuming , we have
where is a positive constant not related to and .
The above theorem shows that the proposed estimator with only one iteration converges to the global estimator with a rate only depending on . Together with Lemma 1, we obtain . In other words, the estimator has the same rate of convergence as the global maximum likelihood estimator.
When , we showed in Lemma S.10 of the Supplementary material that the simple average estimator defined in algorithm 1 satisfies . By comparing with the bound in Theorem 1, we have
Thus our estimator is closer to the global maximum likelihood estimator than the simple average estimator under the condition .
Our next result shows that after at least one iteration the estimator in algorithm 1 with has a tighter bound than in Theorem 1. This explains why we advocate the iterative algorithm 1 in Section 3.
Suppose all the assumptions in Theorem 1 hold. In Algorithm 1, if the number of iterations , we have
where and are positive constants not related to and .
The above theorem implies that when , for any
which improves the result in equation (4.1). When is relatively small, our estimator with is closer to the global maximum likelihood estimator by a factor of than the simple average estimator. We also see an interesting fact that the dimension of the nuisance parameters has no effect on the relative error . This dimension-free phenomenon provides an explanation of why the proposed estimator consistently outperforms the simple average method in our simulation studies in Section 6.
Our next theorem establish the asymptotic normality of the proposed estimator.
Suppose all the assumptions in Theorem 1 hold. Define where is the partial information matrix of defined as . Assuming for some fixed , we have for any , as ,
To obtain the -asymptotic normality of the proposed estimator , we have to restrict to the setting for some . In particular, when or equivalently , Li et al. (2003) showed that the maximum likelihood estimator is asymptotically biased, that is , for some . Since the proposed estimator (with ) satisfies by Theorem 2, it implies that the same asymptotic distribution holds for , for the same if . The same limiting distribution also holds for
. This leads to a phase transition of the limiting distribution ofat . As a result, the condition is essential for the asymptotic unbiasedness of and cannot be further relaxed.
The choice of the initial value in line 3 of Algorithm 1 is not necessarily restricted to the local maximum likelihood estimator. Due to the use of the efficient score, the impact of the initial estimators of the nuisance parameters is alleviated. We can show that the conclusions of Theorem 1-3 still hold if is replaced with any -consistent estimator.
It is well known that the simple average estimator is fully efficient under the homogeneous setting; see e.g., Battey et al. (2018) and Jordan et al. (2018). However, the following proposition shows that this estimator is no longer efficient under the considered heterogeneous setting.
Recall that the simple average estimator is , where . Suppose all the conditions in Theorem 3 hold. We have as ,
In this remark, we compare the asymptotic variance of in Theorem 3 and in Proposition 1. Our proposed estimator is efficient in the sense that its asymptotic variance is equal to the Cramér-Rao lower bound, i.e., , for any . On the other hand, the simple average estimator is not efficient as its asymptotic variance because .
Finally, to construct the confidence interval of, we need to provide a consistent estimator for the averaged partial information matrix . In the following theorem, we apply the density ratio tilting approach to estimate the variance using data only from the first local site.
Suppose all the assumptions in Theorem 3 hold. Define
and . We have as ,
5 Reduce the influence of the local site
In a practical collaborative research network, each site can act as the local site and obtain an estimate using Algorithm 1. To further reduce the impact of the choice of the local site and improve the stability of the algorithm, we can combine these estimates in Algorithm 1 by a simple average. At the -th site, we define the site-specific surrogate score function to be
The surrogate score function is obtained using the individual-level data in the -th site and summary-level data from the other sites. In this case, each site can obtain a surrogate efficient score estimator by solving , and we further combine these estimators by The variance of can be estimated by . The algorithm is summarized below.
6 Simulation Study
We consider a logistic regression between a binary outcomeand a binary exposure