# Heterogeneity-aware and communication-efficient distributed statistical inference

In multicenter research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed. The existing distributed algorithms usually assume the data are homogeneously distributed across sites. This assumption ignores the important fact that the data collected at different sites may come from various sub-populations and environments, which can lead to heterogeneity in the distribution of the data. Ignoring the heterogeneity may lead to erroneous statistical inference. In this paper, we propose distributed algorithms which account for the heterogeneous distributions by allowing site-specific nuisance parameters. The proposed methods extend the surrogate likelihood approach to the heterogeneous setting by applying a novel density ratio tilting method to the efficient score function. The proposed algorithms maintain same communication cost as the existing communication-efficient algorithms. We establish the non-asymptotic risk bound of the proposed distributed estimator and its limiting distribution in the two-index asymptotic setting. In addition, we show that the asymptotic variance of the estimator attains the Cramér-Rao lower bound. Finally, the simulation study shows the proposed algorithms reach higher estimation accuracy compared to several existing methods.

## Authors

• 4 publications
• 24 publications
• 37 publications
• ### Communication-Efficient Distributed Statistical Inference

We present a Communication-efficient Surrogate Likelihood (CSL) framewor...
05/25/2016 ∙ by Michael I. Jordan, et al. ∙ 0

• ### Communication-Efficient Distributed Estimator for Generalized Linear Models with a Diverging Number of Covariates

Distributed statistical inference has recently attracted immense attenti...
01/17/2020 ∙ by Ping Zhou, et al. ∙ 0

• ### Distributed Statistical Inference for Massive Data

This paper considers distributed statistical inference for general symme...
05/29/2018 ∙ by Song Xi Chen, et al. ∙ 0

• ### A Distributed One-Step Estimator

Distributed statistical inference has recently attracted enormous attent...
11/04/2015 ∙ by Cheng Huang, et al. ∙ 0

• ### Towards Efficient Scheduling of Federated Mobile Devices under Computational and Statistical Heterogeneity

Originated from distributed learning, federated learning enables privacy...
05/25/2020 ∙ by Cong Wang, et al. ∙ 5

• ### Fast communication-efficient spectral clustering over distributed data

The last decades have seen a surge of interests in distributed computing...
05/05/2019 ∙ by Donghui Yan, et al. ∙ 0

• ### M-estimation in a diffusion model with application to biosensor transdermal blood alcohol monitoring

With the goal of well-founded statistical inference on an individual's b...
02/13/2020 ∙ by Maria Allayioti, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The growth of availability and variety of clinical data has induced the trend of multicenter research (Sidransky et al., 2009). Multicenter research confers many distinct advantages over single-center studies, including the ability to study rare exposures/outcomes that require larger sample sizes, accelerating the discovery of more generalizable findings, and bringing together investigators who share and leverage resources, expertise, and ideas (Cheng et al., 2017). Since individual-level information is often protected by privacy regularities and rules, directly pooling data across multiple clinical sites is less feasible or requires large amount of operational efforts (Barrows Jr and Clayton, 1996). As a consequence, healthcare systems need more effective tools for evidence synthesis across clinical sites.

Distributed algorithms, also known as the divide-and-conquer procedures, have been applied to multicenter studies. In the classical divide-and-conquer framework, the entire data set is split into multiple subsets and the final estimator is obtained by averaging the local estimators computed using the data from each subset (Li et al., 2013; Chen and Xie, 2014; Lee et al., 2017; Tian and Gu, 2016; Zhao et al., 2016; Lian and Fan, 2017; Battey et al., 2018; Wang et al., 2019). The class of methods adopts the same principle as the meta-analysis in the area of evidence synthesis and systematic review, where the local estimates are combined through a fixed effect or random effects model (DerSimonian and Laird, 1986). When the number of research sites is relatively small, these averaging type of methods are able to perform equally well as the combined analysis using data from all the sites (Hedges, 1983; Olkin and Sampson, 1998; Battey et al., 2018). When the number of research sites is large, as we will demonstrate in the simulation studies, these averaging methods may not be as good as the combined analysis. More importantly, when studying rare conditions, some clinical sites do not have enough number of cases to achieve the asymptotic properties. In such cases, the averaging methods can be suboptimal.

Recently, Wang et al. (2017) and Jordan et al. (2018) proposed a novel surrogate likelihood approach, which approximates the higher order derivatives of the global likelihood by using the likelihood function in a local site. This method has low communication cost and improves the performance of the average method especially when the number of sites is large, see Duan et al. (2019) for a real data application to pharamcoepidemiology. From the practical perspective, the surrogate likelihood approach endowed a highly feasible framework for sharing sensitive data in collaborative environment, especially in biomedical sciences, where the lead investigators often have access to the individual-level data in their home institute, and the collaborative investigators from other sites are willing to share summary statistics but not individual-level information.

Most of the aforementioned distributed algorithms assumed that the data at different sites are independently and identically distributed. However, a prominent concern in multi-center analysis is that there may exist a non-negligible degree of heterogeneity across sites because the samples collected in different sites may come from different sub-populations and environments. One concrete example is the Observational Health Data Sciences and Informatics consortium, which contains over 82 clinical databases from over 20 countries around the world (Hripcsak et al., 2015). The amount of heterogeneity cannot be ignored when implementing distributed algorithms in such healthcare networks.

In this paper, we aim to fill this methodological gap by extending the distributed algorithms to account for the heterogeneous distributions via allowing site-specific nuisance parameters. In particular, we propose a density-ratio tilted surrogate efficient score approach which only requires the individual-level data from a local site and summary statistics from the other sites. To reduce the influence of estimation of the site-specific nuisance parameters, we propose to use the efficient score function for distributed inference rather than the likelihood as in Jordan et al. (2018). We further adjust for the degree of heterogeneity by applying a novel density ratio tilting method to the efficient score function. We refer the resulting score function as the surrogate efficient score function. The estimator is defined as the root of this function. We show that the communication cost of the proposed algorithm is of the same order as Jordan et al. (2018) assuming no heterogeneity and therefore is communication-efficient. We further establish the nonasymptotic risk bound of the estimator and its limiting distribution in the two-index asymptotic setting. Under mild conditions, the asymptotic variance of the proposed estimator attains the Cramér-Rao lower bound.

To the best of our knowledge, Zhao et al. (2016) is the only work in this area that considers a similar heterogeneous setting. They generalized the divide-and-conquer approach by averaging all the local estimators and studied theoretical properties under the partially linear model. Different from this work, we propose a surrogate efficient score approach under a general parametric likelihood framework. Theoretically, we show that our estimator approximates the global maximum likelihood estimator with a faster rate than the simple average estimator; see Remarks 3 and 4. From the inference perspective, our estimator attains the Cramér-Rao lower bound whereas the simple average estimator has larger asymptotic variance and is not efficient; see Remark 6. We also show that the proposed estimator outperforms the simple average estimator in numerical studies.

## 2 Wang et al. (2017) and Jordan et al. (2018)’s surrogate likelihood approach for homogenous distributions

In this section, we briefly review the surrogate likelihood approach for distributed inference by Wang et al. (2017) and Jordan et al. (2018)

. Consider a general parametric likelihood framework, where the random variable

follows the density function indexed by a finite dimensional unknown parameter . In the distributed inference problem, we suppose there are different local sites. Denote to be the -th observation in the -th site. For notational simplicity, we assume that each site has equal sample size . The existing works on distributed inference such as Wang et al. (2017) and Jordan et al. (2018) further assume that all the observations are independently and identically distributed across sites, . Under this assumption, the combined log likelihood function can be written as

 L(θ)=1KnK∑j=1n∑i=1logf(yij;θ):=1KK∑j=1Lj(θ),

where is the log-likelihood function obtained at each site. Due to the communication constraint and privacy concerns, one cannot directly combine data across multiple sites to compute the maximum likelihood estimator. Motivated by the following Taylor expansion of the combined likelihood function around some initial value ,

 L(θ)=L(¯θ)+∇L(¯θ)Tθ−¯θ)+∞∑k=21k!∇kL(¯θ)(θ−¯θ)⊗k, (2.1)

Wang et al. (2017) and Jordan et al. (2018) proposed to construct a surrogate likelihood function by approximating all the higher-order derivatives in equation (2.1) using the individual-level data in one of the sites (such as the first site). When the data are identically and independently distributed across sites, it holds that for any , where is the log-likelihood at the first site. Thus, is an asymptotically unbiased surrogate of . Replacing with , the communication of the higher-order derivatives across sites can be avoided. Hence, by replacing with , which also equals to and dropping the terms independent of , the surrogate likelihood is defined as

 ~L(θ):=L1(θ)+{∇L(¯θ)−∇L1(¯θ)}T. (2.2)

The theoretical properties of the maximum surrogate likelihood estimator have been thoroughly studied; see Wang et al. (2017) and Jordan et al. (2018) for details.

## 3 Surrogate efficient score method for heterogeneous distributions

We consider a heterogeneous setting by assuming the -th observation in the -th site satisfies

 Yij∼f(y;θj),for i∈{1,…,n} and j∈{1,…,K},

where the unknown parameter can be decomposed into . In this partition, is a -dimensional parameter of interest assumed to be common in every site, which is the main motivation of evidence synthesis, and the -dimensional nuisance parameter is allowed to be different across sites. The true value of is denoted by .

If all patient-level data could be pooled together, the combined log-likelihood function is obtained by

 LN(β,Γ)=1KnK∑j=1n∑i=1logf(yij;β,γj):=1KK∑j=1Lj(θj),

where and . In a distributed setting, the surrogate likelihood approach reviewed in Section 2 is not directly applicable due to the following two reasons: the higher order derivatives of the log likelihood function in any site is a biased surrogate of the corresponding higher order derivatives of , and the total number of nuisance parameters increases with sample size if we allow to increase with .

To extend the surrogate likelihood approach to incorporate the site-specific nuisance parameters, we propose to use the efficient score function as a way of reducing the influence of the less accurate estimation of the site-specific . More specifically, the efficient score function for is defined as

 S(β,Γ)=1KnK∑j=1n∑i=1{∇βlogf(yij;β,γj)−I(j)βγI(j)γγ−1∇γlogf(yij;β,γj)},

where and are the corresponding submatrices of the information matrix in the -th site, i.e., .

Instead of constructing a surrogate likelihood function, we aim to construct a surrogate efficient score equation using individual-level data from Site 1, and summary-level data from the other sites. To reduce the complexity of the problem, we first consider an ideal situation where we know the true parameter value . Using the key idea of the surrogate likelihood approach, we aim to construct a function based on the samples in the first site such that

 Eθ∗1{∇kβg∗(Yi1;β)}=E{∇kβS(β,Γ∗)}, (3.1)

holds for any , where we use to denote the expectation with respect to the distribution ,

to denote the expectation with respect to the the joint distribution of the full data, and

to denote the true value of . Denote . We observe that the right hand side of equation (3.1) can be written as

 E{∇kβS(β,Γ∗)}=1KK∑j=1Eθ∗j{∇kβsj(Yij;β,γ∗j)}.

However, the function only involves samples in the first local site, which follows the distribution different from for . To achieve equation (3.1), we propose to construct by using the density ratio tilting method

 g∗(y;β)=1KK∑j=1[f(y;β∗,γ∗j)f(y;β∗,γ∗1){∇βlogf(y;β,γ∗j)−I(j)γβI(j)γγ−1∇γlogf(y;β,γ∗j)}],

where the density ratio is the adjustment that accounts for the heterogeneity of the distributions. We show in Lemma S.11 of the Supplementary Material that holds for any and observation in the first local site.

Since depends on the unknown parameters , and the information matrix , we plug in some initial estimators and and obtain

 g(y;β,¯β,¯Γ)=1KK∑j=1[f(y;¯β,¯γj)f(y;¯β,¯γ1){∇βlogf(y;β,¯γj)−~H(1,j)βγ{~H(1,j)γγ}−1∇γlogf(y;β,¯γj)}],

where and are the submatrices of defined as

 ~H(1,j)=−1nn∑i=1∇2logf(yi1;¯β,¯γj)f(yi1;¯β,¯γj)f(yi1;¯β,¯γ1),

which is obtained by applying the density ratio tilting method again.

Denote . Inspired by the definition of the surrogate likelihood defined in equation (2.2), we define the surrogate efficient score function as

 ~U(β;¯β,¯Γ)=U1(β;¯β,¯Γ)+1KK∑j=1{∇βLj(¯β,¯γj)−¯H(j)βγ(¯Hγγ(j))−1∇γLj(¯β,¯γj)}−U1(¯β;¯β,¯Γ),

where and .

Recall that is constructed based on the samples in Site 1. Thus the surrogate efficient score only requires to transfer a

-dimensional score vector

from each site together with some initial estimators.

The surrogate efficient score estimator is obtained by solving the following equation for within Site 1,

 ~U(β;¯β,¯Γ)=0. (3.2)

In Section 4, we show that the estimation accuracy of the above estimator can be further improved by iterating the above surrogate efficient score procedures. The method is summarized in the following algorithm. The estimator defined in equation (3.2) is equivalent to the estimator with in the following algorithm, which is also known as a oneshot procedure.

###### Remark 1.

From an implementation point of view, the broadcast step (line 3) in the above algorithm can be done by transferring from each site to Site 1, and Site 1 returns the initial estimator to all the sites. It can also be done by uploading all to a shared repository and obtaining at each site. The initial estimator is chosen as a weighted average of the local estimators . When for all , is the simple average estimator (Zhao et al., 2016). We can also choose to be the sample size of each site in the unbalanced design. When is chosen as the inverse of the estimated variance of , the resulting estimator is referred to as the fixed effect meta-analysis estimator. In this paper, we simply choose . The total communication cost per iteration is to transfer numbers across all sites, where is the dimension of . Comparing to the homogeneous setting, the communication cost is of the same order as Jordan et al. (2018), and is communication-efficient.

###### Remark 2.

To further reduce the computational complexity of solving the surrogate efficient score function, we can approximate the combined efficient score function via one-step Taylor expansion,

 S(β,Γ)≈S(¯β,¯Γ)+∇βS(¯β,¯Γ)(β−¯β)+∇ΓS(¯β,¯Γ)(Γ−¯Γ).

First, the property of the efficient score implies so that the last term can be neglected. Next, we replace the Hessian matrix computed by pooling over all the samples with the local surrogate , where . The resulting linear approximation as an estimating function of defines the following estimator

 ~βO=¯β−{∇βˇU1(¯β)}−1S(¯β,¯Γ).

If we treat as an initial estimator, the above estimator can be also viewed as a one-step estimator with a local surrogate of the Hessian matrix.

## 4 Main Results

In this section, we study the theoretical properties of the surrogate efficient score estimator obtained from Algorithm 1. For convenience, we use and to denote positive constants which can vary from place to place. For sequence and , we write () if there exist a constant such that () for all . We first introduce the following assumptions.

###### Assumption 1.

The parameter space of , denoted by , is a compact and convex subset of . The true value is an interior point of .

###### Assumption 2 (Local Strong Convexity).

Define the expected second-order derivative of the negative log likelihood function to be . There exist positive constants , such that for any , the population Hessian matrix satisfies

 μ−Id⪯I(j)(θ∗j)⪯μ+Id,

where is the

dimensional identity matrix. Here, we use the notation that

for two matrices and if is positive semi-definite.

###### Assumption 3.

For all and , all components in and are sub-exponential random variables.

###### Assumption 4 (Identifiability).

For any , we denote . The parameter is the unique maximizer of .

###### Assumption 5 (Smoothness).

For each , let , and define

 H(θj;y)=∇2logf(y;β,γj),

and

 ~H(β,¯ηj;y)=∇2logf(y;β,¯γj)f(y;¯β,¯γj)f(y;¯β,¯γ1).

Define for some radius . There exist some function and , where and

are sub-exponentially distributed for all

and , such that for any and , we have

 ∥H(θj;y)−H(θ′j;y)∥2≤m1(y)∥θ−θ′j∥2.

And for any , , , , we have

 ∥~H(β,¯ηj;y)−~H(β′,¯η′j;y)∥2≤m2(y){∥β−β′∥2+∥¯ηj−¯η′j∥2}.

Assumptions 1, 2, 4, and 5 are standard assumptions in the distributed inference literature; see Jordan et al. (2018)

. Assumption 3 is a general distributional requirements of the data, which covers a wide range of parametric models.

When all individual-level data can be pooled together, the global maximum likelihood estimator is considered as the gold standard in practice. The asymptotic property of the global estimator has been studied in Li et al. (2003) under the asymptotic regime . Our first result characterizes a non-asymptotic bound for the distance between the global maximum likelihood estimator and the true parameter value .

###### Lemma 1.

Under Assumptions 1-5, the global maximum likelihood estimator satisfies

 E∥^β−β∗∥2≤C1(Kn)1/2+C2n

for some positive constants and not related to and .

To the best of our knowledge, this is one of the first nonasymptotic results on the rate of convergence of the maximum likelihood estimator in the presence of site-specific nuisance parameters. Under the classical two-index asymptotics setting, we allow the number of sites to grow with the sample size . As a result, the dimension of nuisance parameters also increases with . Let denote the total sample size. This lemma implies that the convergence rate of is of order when , which attains the optimal rate of convergence with known nuisance parameters. However, when , the estimator has a slower rate . In particular if is fixed, the global maximum likelihood estimator is no longer consistent, which is known as the Neyman-Scott problem (Neyman et al., 1948).

In the following, we characterize the difference between the proposed estimator and the maximum likelihood estimator . We first focus on the estimator defined in equation (3.2), which is identical to in algorithm 1 with the number of iterations .

###### Theorem 1.

Suppose Assumptions 1-5 hold. In Algorithm 1, if the number of iterations , assuming , we have

 E∥~β(1)−^β∥2≤ Cn.

where is a positive constant not related to and .

The above theorem shows that the proposed estimator with only one iteration converges to the global estimator with a rate only depending on . Together with Lemma 1, we obtain . In other words, the estimator has the same rate of convergence as the global maximum likelihood estimator.

###### Remark 3.

When , we showed in Lemma S.10 of the Supplementary material that the simple average estimator defined in algorithm 1 satisfies . By comparing with the bound in Theorem 1, we have

 (4.1)

Thus our estimator is closer to the global maximum likelihood estimator than the simple average estimator under the condition .

Our next result shows that after at least one iteration the estimator in algorithm 1 with has a tighter bound than in Theorem 1. This explains why we advocate the iterative algorithm 1 in Section 3.

###### Theorem 2.

Suppose all the assumptions in Theorem 1 hold. In Algorithm 1, if the number of iterations , we have

 E∥~β(T)−^β∥2≤ C1(K)1/2n+C2n3/2,

where and are positive constants not related to and .

###### Remark 4.

The above theorem implies that when , for any

 E∥~β(T)−^β∥2≤Cn−1/2E∥¯β(1)−^β∥2, (4.2)

which improves the result in equation (4.1). When is relatively small, our estimator with is closer to the global maximum likelihood estimator by a factor of than the simple average estimator. We also see an interesting fact that the dimension of the nuisance parameters has no effect on the relative error . This dimension-free phenomenon provides an explanation of why the proposed estimator consistently outperforms the simple average method in our simulation studies in Section 6.

Our next theorem establish the asymptotic normality of the proposed estimator.

###### Theorem 3.

Suppose all the assumptions in Theorem 1 hold. Define where is the partial information matrix of defined as . Assuming for some fixed , we have for any , as ,

 Kn(~β(T)−β∗)Tβ|γ(~β(T)−β∗)→χ2p.

To obtain the -asymptotic normality of the proposed estimator , we have to restrict to the setting for some . In particular, when or equivalently , Li et al. (2003) showed that the maximum likelihood estimator is asymptotically biased, that is , for some . Since the proposed estimator (with ) satisfies by Theorem 2, it implies that the same asymptotic distribution holds for , for the same if . The same limiting distribution also holds for

. This leads to a phase transition of the limiting distribution of

at . As a result, the condition is essential for the asymptotic unbiasedness of and cannot be further relaxed.

###### Remark 5.

The choice of the initial value in line 3 of Algorithm 1 is not necessarily restricted to the local maximum likelihood estimator. Due to the use of the efficient score, the impact of the initial estimators of the nuisance parameters is alleviated. We can show that the conclusions of Theorem 1-3 still hold if is replaced with any -consistent estimator.

It is well known that the simple average estimator is fully efficient under the homogeneous setting; see e.g., Battey et al. (2018) and Jordan et al. (2018). However, the following proposition shows that this estimator is no longer efficient under the considered heterogeneous setting.

###### Proposition 1.

Recall that the simple average estimator is , where . Suppose all the conditions in Theorem 3 hold. We have as ,

 Kn(¯β−β∗)T{1KK∑j=1I(j)β|γ−1}−1(¯β−β∗)→χ2p.
###### Remark 6.

In this remark, we compare the asymptotic variance of in Theorem 3 and in Proposition 1. Our proposed estimator is efficient in the sense that its asymptotic variance is equal to the Cramér-Rao lower bound, i.e., , for any . On the other hand, the simple average estimator is not efficient as its asymptotic variance because .

Finally, to construct the confidence interval of

, we need to provide a consistent estimator for the averaged partial information matrix . In the following theorem, we apply the density ratio tilting approach to estimate the variance using data only from the first local site.

###### Theorem 4.

Suppose all the assumptions in Theorem 3 hold. Define

 ~I(j)=−1KnK∑j=1n∑i=1f(yi1,~β(T),¯γ(T)j)f(yi1,~β(T),¯γ(T)1)∇2logf(yi1;~β(T),¯γ(T)j),

and . We have as ,

 Kn(~β(T)−β∗)T(~β(T)−β∗)→χ2p.

## 5 Reduce the influence of the local site

In a practical collaborative research network, each site can act as the local site and obtain an estimate using Algorithm 1. To further reduce the impact of the choice of the local site and improve the stability of the algorithm, we can combine these estimates in Algorithm 1 by a simple average. At the -th site, we define the site-specific surrogate score function to be

 ~Uj(β;¯β,¯Γ)=Uj(β;¯β,¯Γ)+1KK∑k=1{∇βLk(¯β,¯γk)−¯H(k)βγ(¯Hγγ(k))−1∇γLk(¯β,¯γj)}−Uj(¯β;¯β,¯Γ),

where

 Uj(β;¯β,¯Γ)=1KnK∑k=1n∑i=1 f(yij;¯β,¯γk)f(yij;¯β,γj){∇βlogf(yij;β,¯γk)−~H(j,k)βγ{~H(j,k)γγ}−1∇γlogf(yij;β,¯γk)}

and

 ~H(j,k)=−1nn∑i=1∇2logf(yij;¯β,¯γk)f(yij;¯β,¯γk)f(yij;¯β,¯γj).

The surrogate score function is obtained using the individual-level data in the -th site and summary-level data from the other sites. In this case, each site can obtain a surrogate efficient score estimator by solving , and we further combine these estimators by The variance of can be estimated by . The algorithm is summarized below.

## 6 Simulation Study

We consider a logistic regression between a binary outcome

and a binary exposure