What is the relevance of the underlying computational architecture to statistical inference? Classically, the answer has been “not much”—the naive abstraction of a sequential program running on a single machine and providing instantaneous access to arbitrary data points has provided a standard conceptual starting point for statistical computing. In the modern era, however, it is commonplace for data analyses to run on hundreds or thousands of machines, with the data distributed across those machines and no longer available in a single central location. Moreover, the end of Moore’s law has changed computer science—the focus is now on parallel, distributed architectures and, on the algorithmic front, on divide-and-conquer procedures. This has serious implications for statistical inference. Naively dividing datasets into subsets that are processed separately, with a naive merging of the results, can yield inference procedures that are highly biased and highly variable. Naive application of traditional statistical methodology can yield procedures that incur exorbitant communication costs.
Historically, the area in which statisticians have engaged most deeply with practical computing concerns has been in the numerical linear algebra needed to support regression and multivariate statistics, including both sparse and dense matrix algorithms. It is thus noteworthy that over the past decade there has been a revolution in numerical linear algebra in which new “communication-avoiding” algorithms have been developed to replace classical matrix routines Demmel et al. (2012). The new algorithms can run significantly faster than classical algorithms on parallel, distributed architectures.
A statistical literature on parallel and distributed inference has begun to emerge, both in a frequentist setting (Duchi et al., 2012; Zhang et al., 2013; Kannan et al., 2014; Kleiner et al., 2014; Shamir et al., 2014; Mackey et al., 2015; Zhang and Lin, 2015; Lee et al., 2015), and Bayesian setting (Suchard et al., 2010; Cleveland and Hafen, 2014; Maclaurin and Adams, 2014; Wang and Dunson, 2015; Neiswanger et al., 2015; Rabinovich et al., 2016; Scott et al., 2016; Terenin et al., 2016). This literature has focused on data-parallel procedures in which the overall dataset is broken into subsets that are processed independently. To the extent that communication-avoiding procedures have been discussed explicitly, the focus has been on “one-shot” or “embarrassingly parallel” approaches only use one round of communication in which estimators or posterior samples are obtained in parallel on local machines, are communicated to a center node, and then combined to form a global estimator or approximation to the posterior distribution (Zhang et al., 2013; Lee et al., 2015; Wang and Dunson, 2015; Neiswanger et al., 2015). In the frequentist setting, most one-shot approaches rely on averaging (Zhang et al., 2013), where the global estimator is the average of the local estimators. Lee et al. (2015)
extends this idea to high-dimensional sparse linear regression by combining local debiased Lasso estimates(van de Geer et al., 2014). Recent work by Duchi et al. (2015) show that under certain conditions, these averaging estimators can attain the information-theoretic complexity lower bound—for linear regression, at least bits must be communicated in order to attain the minimax rate of parameter estimation, where is the dimension of the parameter and is the number of machines. This holds even in the sparse setting (Braverman et al., 2015).
These averaging-based, one-shot communication approaches suffer from several drawbacks. First, they have generally been limited to point estimation; it is not straightforward to create confidence intervals/regions and hypothesis tests based on the averaging estimator. Second, in order for the averaging estimator to achieve the minimax rate of convergence, each local machine must have access to at least samples, where is the total sample size. In other words, the number of machines should be much smaller than ; a highly restrictive assumption. Third, when the statistical is nonlinear, averaging estimators can perform poorly; for example, our empirical study shows that even for small , of order , the averaging estimator only exhibits a slight improvement over purely local estimators.
In the Bayesian setting, embarrassingly parallel approaches run Markov chain Monte Carlo (MCMC) algorithms in parallel across local machines and transmit the local posterior samples to a central node to produce an overall approximation to the global posterior distribution. Unfortunately, when the dimension
is high, the number of samples obtained locally must be large due to the curse of dimensionality, incurring significant communication costs. Also, when combining local posterior samples in the central node, existing approaches that approximate the global posterior distribution by a weighted empirical distribution of “averaging draws”(Wang and Dunson, 2015; Neiswanger et al., 2015) tend to suffer from the weight-degeneracy issue (weights collapse to only a few samples) when is large.
In this paper, we formulate a unified framework for distributed statistical inference. We refer to our framework as the Communication-efficient Surrogate Likelihood
(CSL) framework. From the frequentist perspective, CSL provides a communication-efficient surrogate to the global likelihood function that can play the role of the global likelihood function in forming the maximum likelihood estimator (MLE) in regular parametric models or the penalized MLE in high-dimensional models. From a Bayesian perspective, CSL can be used to form a quasi-posterior distribution(Chernozhukov and Hong, 2003) as a surrogate for the full posterior. The CSL approximation can be constructed efficiently by communicating bits. After its construction, CSL can be efficiently evaluated by using the samples in a single local machine. Even in a non-distributed Bayesian setting, CSL can be used as a computationally-efficient surrogate to the likelihood function by pre-dividing the dataset into subsamples—the computational complexity of one iteration of MCMC is then reduced by a factor of .
Our CSL-based distributed inference approach overcomes the aforementioned drawbacks associated with the one-shot and embarrassingly parallel approaches. In the frequentist framework, a CSL-based estimator can achieve the same rate of convergence as the global likelihood-based estimator while incurring a communication complexity of only . Moreover, the CSL framework can readily be applied iteratively, with the resulting multi-round algorithm achieving a geometric convergence rate with contraction factor , where is the number of samples in each local machine. This rate of convergence significantly improves on analyses based on condition-number contraction factors used to analyze methods that form the global gradient in each iteration by combining local gradients. As an implication, in order to achieve the same accuracy as the global likelihood-based estimator, can be independent of the total sample size as long as iterations are applied, which is constant for . In contrast, the averaging estimator requires . Thus, due to the fast rate, usually two-three iterations suffice for our procedure to match the same accuracy of the global likelihood-based estimator even for relatively large (See Section 4.1 for more details). Unlike bootstrap-based approaches (Zhang et al., 2013) for boosting accuracy, the additional complexity of the iterative version of our approach grows only linearly in the number of iterations. Finally, our empirical study suggests that a CSL-based estimator may exhibit significant improvement over the averaging estimator for nonlinear distributed statistical inference problems.
For high-dimensional -regularized estimation, the CSL framework yields an algorithm that communicates bits, attaining the optimal communication/risk trade off (Garg et al., 2014). This improves over the averaging method of Lee et al. (2015) because it requires times less computation, and allows for iterative refinement to obtain arbitrarily low optimization error in a logarithmic number of rounds111We note that during the preparation of this manuscript, we became aware of concurrent work of Wang et al. (2016) who also study a communication-efficient surrogate likelihood. Their focus is solely on the high-dimensional linear model setting..
In the Bayesian framework, our method does not require transmitting local posterior samples and is thus free from the weight degeneracy issue. This makes the communication complexity of our approach considerably lower than competing embarrassingly parallel Bayesian computation approaches.
There is also relevant work in the distributed optimization literature, notably the distributed approximate Newton algorithm (DANE) proposed in Shamir et al. (2014). Here the idea is to combine gradient descent with a local Newton method; as we will see, a similar algorithmic structure emerges from the CSL framework. DANE has been rigorously analyzed only for quadratic objectives, and indeed the analysis in Shamir et al. (2014) does not imply that DANE can improve over the gradient method for non-quadratic objectives. In contrast, our analysis demonstrates fast convergence rates for a broad class of regular parametric models and high-dimensional models and is not restricted to quadratic objectives. Another related line of work is the iterated Hessian sketch (IHS) algorithm Pilanci and Wainwright (2014) for constrained least-squares optimization. DANE applied to quadratic problems can be viewed as a special case of IHS by choosing the sketching to be a rescaled subsampling matrix; however, the analysis in Pilanci and Wainwright (2014) only applies to a class of low-incoherence sketching matrices that excludes subsampling.
The remainder of this paper is organized as follows. In Section 2, we informally present the motivation for CSL. Section 3 presents algorithms and theory for three different problems: parameter estimation in low-dimensional regular parametric models (Section 3.1), regularized parameter estimation in the high-dimensional problems (Section 3.2), and Bayesian inference in regular parametric models (Section 3.3). Section 4 presents experimental results in these three settings. All proofs are provided in the Appendix.
2 Background and problem formulation
We begin by setting up our general framework for distributed statistical inference. We then turn to a description of the CSL methodology, demonstrating its application to both frequentist and Bayesian inference.
2.1 Statistical models with distributed data
Let denote identically distributed observations with marginal distribution , where is a family of statistical models parametrized by , is the parameter space and is the true data-generating parameter. Suppose that the data points are stored in a distributed manner in which each machine stores a subsample of observations. Let denote the subsample that is stored in the th machine for . Our goal is to conduct statistical inference on the parameter while taking into consideration the communication cost among the machines. For example, we may want to find a point estimator and an associated confidence interval (region).
be a twice-differentiable loss function such that the true parameter is a minimizer of the population risk; that is,
Define the local and global loss functions as
Here is the loss function evaluated at by using the local data stored in machine . The negative log-likelihood function is a standard example of the loss function .
2.2 Distributed statistical inference
In this subsection, we motivate the CSL methodology by constructing a surrogate loss that approximates the global loss function in a communication-efficient manner. We show that it can be constructed in any local machine by communicating at most
-dim vectors. After the construction,can be used to replace the global loss function in various statistical inference procedures by only using the data in a local machine (see Sections 3.1-3.3
). We aim to show that this distributed inference framework can simultaneously achieve high statistical accuracy and low communication cost. In this section we motivate our construction using heuristic arguments; a rigorous analysis is provided in Section3 to follow.
Our motivation starts from the Taylor series expansion of . Viewing as an analytic function, we expand it into an infinite series:
Here is any initial estimator of , for example, the local empirical loss minimizer in the first machine . Because the data is split across machines, evaluating the derivatives requires one communication round. However, unlike the -dim gradient vector , the higher-order derivatives require communicating more than bits from each machine. This reasoning motivates us to replace the global higher-order derivatives with local derivatives, leading to the following approximation of :
where the fact that is a consequence of matrix concentration.
We now use a Taylor expansion of around to replace the infinite sum of high-order derivatives in expression (5) with . This yields:
Finally, we omit the additive constants in (7) and redefine as follows:
Henceforth we refer to this expression for as a surrogate loss function. In the remainder of the section we present three examples of using this surrogate loss function for frequentist and Bayesian inference. A rigorous justification for , which effectively provides conditions under which the terms in (6) are small, follows in Section 3.
In the low-dimensional regime where the dimensionality of parameter space is , the global empirical loss minimizer,
achieves a root- rate of convergence under mild conditions. One may construct confidence regions associated with using the sandwiched covariance matrix (see, e.g., (11)). In our distributed inference framework, we aim to capture some of the desirable properties of by replacing the global loss function with the surrogate loss function and defining the following communication-efficient estimator:
Indeed, in Section 3.1, we show that is equivalent to up to higher-order terms, and we provide two ways to construct confidence regions for using local observations stored in machine .
In anticipation of that theoretical result, we give a heuristic argument for why is a good estimator. For convenience, we assume that the empirical risk function has a unique minimizer. First consider the global empirical loss minimizer . Under our assumption that the loss function is twice-differentiable, is the unique solution of equation222There is no Taylor’s theorem for vector-valued functions, but we formalize this heuristic in Section 3.1.
By solving this equation, we obtain , as long as the Hessian matrix is non-singular. Now let is turn to the surrogate loss minimizer . An analogous argument leads to and we only need to show that is of order . In fact, by our construction,
which is of order as long as where is the number of machines. For example, this requirement on initial estimator is satisfied by the minimizer of the subsample loss function when .
Example (High-dimensional regularized estimator):
In the high-dimensional regime, where the dimensionality can be much larger than the sample size , we need to impose a low-dimensional structural assumption, such as sparsity, on the unknown true parameter . Under such an assumption, regularized estimators are known to be effective for estimating . For concreteness, we focus on the sparsity assumption that most components of the -dim vector is zero, and consider the -regularized estimator,
, as the benchmark estimator that we want to approximate, where is the regularization parameter. In the distributed inference framework, we consider the following estimator obtained from the surrogate loss function :
In Section 3.2, we show that achieves the same rate of convergence as the benchmark estimator under a set of mild conditions. This idea of using the surrogate loss function to approximate the global regularized loss function is general and is applicable to other high-dimensional problems.
Example (Bayesian inference):
In the Bayesian framework, viewing parameter as random, we place a prior distribution over parameter space . For convenience, we also use the notation to denote the pdf of the prior distribution at point
. According to Bayes’ theorem, the posterior distribution satisfies
The loss function corresponds to the negative log-likelihood function for the statistical model and is the global negative log-likelihood associated with the observations . The posterior distribution can be used to conduct statistical inference. For example, we may construct an estimator of as the posterior expectation and use the highest posterior region as a credible region for this estimator. Since the additive constant in expression (6) can be absorbed into the normalizing constant, we may use the surrogate posterior distribution,
, to approximate the global posterior distribution . In Section 3.3, we formalize this argument and show that this surrogate posterior gives a good approximation the global posterior.
From now on, we will refer the methodology of using the surrogate loss function to approximate the global loss function for distributed statistical inference as a Communication-efficient Surrogate Likelihood (CSL) method. Although our focus is on distributed inference, we also wish to note that the idea of computing the global likelihood function using subsamples may be useful not only in the distributed inference framework, but also in a single-machine setting in which the sample size is so large that the evaluation of the likelihood function or its gradient is unduly expensive. Using our surrogate loss function , we only need one pass over the entire dataset to construct . After its construction, can be efficiently evaluated by using a small subset of the data.
3 Main results and their consequences
In this section, we delve into the three examples in Section 2.2 of applying the CSL method. For each of the examples, we provide an explicit bound on either the estimation error of the resulting estimator or the approximation error of the approximated posterior .
3.1 Communication-efficient -estimators in low dimensions
In this subsection, we consider a low-dimensional parametric family , where the dimensionality of is much smaller than the sample size . Under this setting, the minimizer of the population risk in optimization problem (1) is unique under the set of regularity conditions to follow and is identifiable. As a concrete example, we may consider the negative log-likelihood function as the loss function, where is the pdf for . Note that the developments in this subsection can also be extended to misspecified families where the marginal distribution of the observations is not contained in the model space . Under misspecification, we can view the parameter associated with the projection of the true data generating model onto the misspecified model space, , as the “true parameter.” The results under misspecification are similar to the well-specified case and are omitted in this paper.
For low-dimensional parametric models, we impose some regularity conditions on the parameter space, the loss function and the associated population risk function . These conditions are standard in classical statistical analysis of -estimators. In the rest of the paper, we call a parametric model that satisfies this set of regularity conditions a regular parametric model. Our first assumption describes the relationship of the parameter space and the true parameter .
Assumption PA (Parameter space):
The parameter space is a compact and convex subset of . Moreover, and .
The second assumption is a local identifiability condition, ensuring that is a local minimum of .
Assumption PB (Local convexity):
The Hessian matrix of the population risk function is invertible at : there exist two positive constants , such that .
When the loss function is the negative log-likelihood function, the corresponding Hessian matrix is an information matrix.
Our next assumption is a global identifiability condition, which is a standard condition for proving estimation consistency.
Assumption PC (Identifiability):
For any , there exists , such that
Assumption PD (Smoothness):
There exist constants and a function such that
Moreover, the function satisfies for some constant .
Based on the heuristic argument in Section 2.2, we propose to use the surrogate function defined in (8) as the objective function for constructing an M-estimator in regular parametric models. Our first result shows that under Assumptions PA-PD, given any reasonably good initial estimator , any minimizer of , i.e.,
significantly boosts the accuracy in terms of the approximation error to the global empirical risk minimizer .
Suppose that Assumptions PA-PD hold and the initial estimator lies in the neighborhood of . Then any minimizer of the surrogate loss function satisfies
with probability at least , where the constants and are independent of .
as long as , which is true for , the empirical risk minimizer in local machine . To formalize this argument, we have the following corollary that provides an risk bound for .
Under the conditions of Theorem 3.1, we have
where and is some constant independent of .
Note that the Hájek-Le Cam minimax theorem guarantees that for any estimator based on samples, we have
Therefore, the estimator is (first-order) minimax-optimal and achieves the Cramér-Rao lower bound when the loss function is the negative log-likelihood function.
The computational complexity of exactly minimizing the surrogate loss in (9) can be further reduced by using a local quadratic approximation to . In fact, we have by Taylor’s theorem that
As before, we replace the global gradient with the local gradient , which leads to the following quadratic surrogate loss:
Because the surrogate loss functions and agree up to the second-order Taylor expansion, they behave similarly when used as objective functions for constructing -estimators. This motivates the closed-form estimator
The next theorem shows that satisfies the same estimation bound as . Unlike the classical one-step MLE that requires the initial estimator to be within an neighborhood of the truth , we only require to be .
Suppose that Assumptions PA-PD hold and the initial estimator satisfies . Then the local one-step estimator satisfies
with probability at least , where and are independent of .
The analogue of Corollary 3.2 can also be stated for .
Iterative local estimation algorithm:
Theorem 3.1 (Theorem 3.3) suggests that an iterative algorithm may reduce the approximation error by a factor of in each iteration as long as the initial estimator satisfies , or equivalently, . We refer to such an algorithm as an Iterative Local Estimation Algorithm (ILEA, see Algorithm 1). In each iteration of ILEA, we set as the current iterate , construct the surrogate loss function , and then solve for the next iterate by either exactly minimizing the surrogate loss:
or by forming a local one-step quadratic approximation:
where is positive constant independent of . If the desired accuracy is the statistical accuracy of the MLE and our initial estimator is -consistent, then we need to conduct at most
iterations. ILEA interpolates between the gradient method and Newton’s algorithm. Whenis large relative to , then ILEA behaves like Newton’s algorithm, and we achieve the optimal statistical accuracy in one iteration. If is a fixed constant size, then ILEA reduces to a preconditioned gradient method. By appropriately choosing the subsample size , ILEA achieves a trade-off among storage, communication, and computational complexities, depending on specific constraints of computing resources.
Confidence region construction:
We now consider a natural class of local statistical inference procedures based on the surrogate function that only uses the subsample in Machine . It is a classical result that under Assumptions PA-PD, the global empirical risk minimizer satisfies (see the proof of Corollary 3.4 in Section A.4)
where is the so-called sandwich covariance matrix. For example, when corresponds to the negative log-likelihood function, will be the inverse of the information matrix. It is easy to see that the plug-in estimator,
is a consistent estimator of the asymptotic covariance matrix ; that is, in probability as . Based on the limiting distribution of and the plug-in estimator , we can conduct statistical inference, for example, constructing confidence intervals for .
The following corollary shows that for any reasonably good initial estimator , the asymptotic distribution of either the minimizer of the surrogate function or the local one-step quadratic approximated estimator matches that of the global empirical risk minimizer . Moreover, we also have a consistent estimator of using only the local information in Machine . Therefore, we can conduct statistical inference locally without access to the entire data while achieving the same asymptotic inferential accuracy as global statistical inference procedures.
Under the same set of assumptions in Theorem 3.1, if the initial estimator satisfies , then the surrogate minimizer satisfies
and if , then
Moreover, the following plug-in estimator:
is a consistent estimator for as . If we also have , then the plug-in estimator
is also a consistent estimator for as . Similar results hold for the local one-step quadratic approximated estimator under the assumptions of Theorem 3.3.
Corollary 3.4 illustrates that we may substitute as the global loss function and use it for statistical inference— is precisely the plug-in estimator of the sandwiched covariance matrix using the surrogate loss function (cf. equation (12)). In the special case when is the negative log-likelihood function, we may instead use as our plug-in estimator for . tends to be a better estimator than when
, since the varianceof the middle term in equation (14) is much smaller than variance of the middle term in equation (13). See Section 4.1 for an empirical comparison of using and for constructing confidence intervals.
3.2 Communication-efficient regularized estimators with -regularizer
In this subsection, we consider high-dimensional estimation problems where the dimensionality of parameter can be much larger than the sample size . Although the development here applies to a broader class of problems, we focus on -regularized procedures. -regularized estimators work well under the sparsity assumption that most components of the true parameter is zero. Let be a subset of that encodes the sparsity pattern of and let . Using the surrogate loss function as a proxy to the global likelihood function in -regularized estimation procedures, we obtain the following communication-efficient regularized estimator:
We study the statistical precision of this estimator in the high-dimensional regime.
We first present a theorem on the statistical error bound of the estimator for general loss function . We then illustrate the use of the theorem in the settings of high-dimensional linear models and generalized linear models. We begin by stating our assumptions.
Assumption HA (Restricted strongly convexity):
The local loss function at machine is restricted strongly convex over : for all ,
where is some positive constant independent of .
As the name suggested, restricted strongly convexity requires the global loss function to be a strongly convex function when restricted to the cone .
Assumption HB (Restricted Lipschitz Hessian):
Both the local and global loss function and have restricted Lipschitz Hessian at radius : for all ,
where is some positive constant independent of .
The restricted Lipschitz Hessian condition is always satisfied for linear models where the Hessian is a constant function of .
Suppose that Assumption HA and Assumption HB at radius are true. If regularization parameter satisfies , then
The lower bound condition on the regularization parameter