# Distributed linear regression by averaging

Modern massive datasets pose an enormous computational burden to practitioners. Distributed computation has emerged as a universal approach to ease the burden: Datasets are partitioned over machines, which compute locally, and communicate short messages. Distributed data also arises due to privacy reasons, such as in medicine. It is important to study how to do statistical inference and machine learning in a distributed setting. In this paper, we study one-step parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, and take a weighted average of the parameters. How much do we lose compared to doing linear regression on the full data? Here we study the performance loss in estimation error, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We discover several key phenomena. First, averaging is not optimal, and we find the exact performance loss. Our results are simple to use in practice. Second, different problems are affected differently by the distributed framework. Estimation error and confidence interval length increases a lot, while prediction error increases much less. These results match simulations and a data analysis example. We rely on recent results from random matrix theory, where we develop a new calculus of deterministic equivalents as a tool of broader interest.

There are no comments yet.

## Authors

• 21 publications
• 3 publications
• ### On the Optimality of Averaging in Distributed Statistical Learning

A common approach to statistical learning with big-data is to randomly s...
07/10/2014 ∙ by Jonathan Rosenblatt, et al. ∙ 0

• ### Generalization Error for Linear Regression under Distributed Learning

Distributed learning facilitates the scaling-up of data processing by di...
04/30/2020 ∙ by Martin Hellkvist, et al. ∙ 0

• ### One-shot distributed ridge regression in high dimensions

In many areas, practitioners need to analyze large datasets that challen...
03/22/2019 ∙ by Edgar Dobriban, et al. ∙ 0

• ### On the asymptotic distribution of model averaging based on information criterion

Smoothed AIC (S-AIC) and Smoothed BIC (S-BIC) are very widely used in mo...
10/27/2019 ∙ by Miaomiao Wang, et al. ∙ 0

• ### Linear Regression with Distributed Learning: A Generalization Error Perspective

Distributed learning provides an attractive framework for scaling the le...
01/22/2021 ∙ by Martin Hellkvist, et al. ∙ 0

• ### A New Look at an Old Problem: A Universal Learning Approach to Linear Regression

Linear regression is a classical paradigm in statistics. A new look at i...
05/12/2019 ∙ by Koby Bibas, et al. ∙ 0

• ### A distributed regression analysis application based on SAS software Part II: Cox proportional hazards regression

Previous work has demonstrated the feasibility and value of conducting d...
08/07/2018 ∙ by Yury Vilk, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Datasets are constantly increasing in size and complexity. This leads to important challenges for practitioners. Statistical inference and machine learning, which used to be computationally convenient on small datasets, now bring an enormous computational burden.

Distributed computation is a universal approach to deal with large datasets. Datasets are partitioned across several machines (or workers). The machines perform computations locally and communicate only small bits of information with each other. They coordinate to compute the desired quantity. This is the standard approach taken at large technology companies, which routinely deal with huge datasets spread over computer clusters. What are the best ways to divide up and coordinate the work?

The same problem arises when the data is distributed due to privacy, security, or ethical concerns. For instance, medical and healthcare data is typically distributed across hospitals or medical units. The parties agree that they want to aggregate the results. At the same time, they are concerned with privacy, and do not want other parties to look at their data. How can they compute the desired aggregates, without sharing the entire data?

In both cases, the key question is how to do statistical estimation and machine learning in a distributed setting. And what performance can the best methods achieve? This is an important question of broad interest, and it is expected that the area of distributed estimation and computation will grow even more in the future.

In this paper, we develop precise theoretical answers to fundamental questions in distributed estimation. We study one-step parameter averaging in statistical linear models under data parallelism

. Specifically, suppose we do linear regression (Ordinary Least Squares, OLS) on each subset of a dataset distributed over

machines, and take an optimal weighted average of the regression coefficients. How do the statistical and predictive properties of this estimator compare to doing OLS on the full data?

We study the behavior of several learning and inference problems, such as estimation error, test error (out-of-sample prediction error), and confidence intervals. We also consider a high-dimensional (or proportional-limit) setting where the number of parameters is of the same order as the number of total samples (i.e., the size of the training data). We discover the following key phenomena, some of which are surprising in the context of existing work:

1. Sub-optimality. One-step averaging is not optimal, meaning that it leads to a performance decay. In contrast to some recent work (see the related work section), we find that there is a clear performance loss due to one-step averaging. This loss is due to the essential high-dimensional nature of our problem. However, we can quantify this loss precisely. This paves the way to developing improved methods for distributed estimation.

2. Strong problem-dependence. Different learning and inference problems are affected differently by the distributed framework. Specifically, estimation error and the length of confidence intervals increases a lot, while prediction error increases less. This phenomenon was apparently not noticed before.

3. Simple form and universality. We discover that the asymptotic efficiencies have simple forms that are often universal. Specifically, they do not depend on the covariance matrix of the data, or on the sample sizes on the local machines. For instance, the estimation efficiency decreases linearly in the number of machines (see Figure 1 and Table 1).

While there is already a lot of work in this direction (see Section 4) our results are new and complementary. The key elements of novelty are: (1) The sample size and the dimension are comparable, and we do not assume sparsity. Hence the problems we study are essentially high-dimensional. There is very little work in this direction. (2) We have a new mathematical approach, using recent results from asymptotic random matrix theory. Our approach also develops a novel theoretical tool, the calculus of deterministic equivalents, which may be useful in other problems as well.

### 1.1 Summary of our results

Our contributions, and the structure of our paper, are as follows (see Section 4 for some related work):

We start with studying estimation error in linear models. We find an explicit expression for the finite sample relative efficiency of one-step averaging compared to OLS. We show directly that it is less than or equal to unity, by showing that the function is concave on positive definite matrices.

We then consider asymptotics, first under Marchenko-Pastur models where the data is iid from a distribution with a general covariance matrix. We find a simple expression for the limit of the relative efficiency, called the asymptotic relative efficiency. We give a multi-response regression

characterization that gives the correct ”degrees of freedom” for distributed regression.

Next, we study more general elliptical models, where the different samples have different scales. We find an expression for the ARE, albeit less explicit than the previous one. We show that the ARE is monotone and convex. We also perform a detailed worst-case analysis, giving examples of elliptical scale distributions for which the split across two machines leads to a huge increase in estimation error.

We then develop a more general framework for evaluating the relative efficiency of predicting linear functionals of the regression coefficients. We show that test error (out-of-sample prediction), training error (in-sample prediction), confidence intervals, and regression function estimation all fall into this general framework. We find the optimal relative efficiency, and show that it depends on the traces of certain functionals of . We also generalize the concavity property obtained above.

To evaluate the needed trace functionals, we develop a calculus of deterministic equivalents in random matrix theory. Such deterministic equivalents have appeared in prior work, but we develop a more general approach. Specifically, we define two sequences of matrices to be asymptotically equivalent, if they have the same limits of inner products with any other fixed sequences of matrices. In terms of this definition, we present a general Marchenko-Pastur theorem in elliptical models, which is a streamlined version of previously obtained results. We also give several rules of the calculus, including rules for sums, products, traces and Stieltjes transforms.

As an application of the calculus of deterministic equivalents, we find the limits of the relative efficiencies for the four functionals introduced above, test and training error, confidence intervals, and regression function estimation, in a distributed setting. We show that the efficiency loss depends strongly on the learning problem. See Figure 1 and Table 1. For instance, estimation error and CI length is much more affected than test error.

We show that our theoretical results are very accurate in numerical simulations throughout the paper, and also in Section 7. We also illustrate that our results are accurate in an empirical data example using the NYC Flights dataset.

The code for our paper is available at github.com/dobriban/dist.

## 2 Estimation

### 2.1 Finite sample relative efficiency

We start by studying estimation error in linear models. Consider the standard linear model Y = Xβ+. Here we have an outcome variable along with some covariates , and want to understand their relationship. We observe such data points, arranging their outcomes into the vector , and their covariates into the matrix . We assume that depends linearly on , via some unknown parameter vector .

We consider the case where there are more samples than training data points, i.e., , while can also be large. In that case, a gold standard is the usual least squares estimator (ordinary least squares or OLS) = (X^⊤X)^-1X^⊤Y. We also assume that the coordinates of the noise

are uncorrelated and have variance

. Then is well known that the mean squared error (MSE) of OLS equals .

Suppose now that the samples are distributed across machines (these can be real machines, but they can also be—say—sites or hospitals in medical applications). The -th machine has the matrix , containing samples, and also the vector of the corresponding outcomes for those samples. Thus, the -th worker has access to only a subset of training data points out of the total of training data points. For instance, if the data points denote users, then they may be partitioned into sets based on country of residence, and we may have samples from the United States on one server, samples from Canada on another server, etc. The broad question is: How can we estimate the unknown regression parameter if we need to do most of the computations locally?

Let us write the partitioned data as

 X=⎡⎢⎣X1…Xk⎤⎥⎦,Y=⎡⎢⎣Y1…Yk⎤⎥⎦.

We also assume that each local OLS estimator is well defined, which requires that the number of local training data points must be at least on each machine. We consider combining the local OLS estimators at a parameter server via one-step weighted averaging. Since they are uncorrelated, unbiased for , and have MSE , we can find the optimal unbiased weighted estimator

with , and its mean squared error. As a consequence, we can find an explicit expression for the finite sample relative efficiency for estimation.

[Relative efficiency in OLS] Consider the distributed linear regression problem described above. The optimal unbiased weighted estimator has weights proportional to . Its mean squared error equals

Therefore, the relative efficiency of the distributed estimator with respect to the full estimator equals RE(X_1,…,X_k) = ∥- β2dist- β2 = [(X^⊤X)^-1]⋅[∑_i=1^k 1[(XiXi)-1]]. Recall here that .

See Section 9.1 for the proof.

The relative efficiency is a fundamental quantity, because it quantifies the loss of efficiency from distributed estimation. It is of great interest to understand the behavior of this quantity. For instance, for what matrices does it equal nearly one? In a simple special case, when each local problem is orthonormal, in the sense that , we see that the relative efficiency is unity, so that there is no loss of efficiency in distributed OLS. More generally, the same holds when the Gram matrices are proportional to each other.

To achieve a deeper understanding of the relative efficiency, we will first prove directly that it is at most unity. This turns out to require the convexity of the matrix functional . See Section 9.2 for the proof.

[Concavity for relative efficiency] The function

 f(X)=1/\tr(X−1)

is concave on positive definite matrices. As a consequence, the relative efficiency for distributed estimation is at most unity for any matrices :

 RE(X1,…,Xk)≤1.

The concavity claim seems to be distantly related to the so-called matrix -entropies (Chen and Tropp, 2014), however there does not seem to be a direct connection.

If not all sample sizes are larger than , and the OLS estimator is not well-defined on each machine, then we can still take the average of the estimators that are well-defined. It is easy to see that this corresponds to reducing the sample size from to the effective sample size , where is the set of machines such that OLS is well-defined. Therefore, our results can still be used, with replaced by .

However, the form of the finite sample relative efficiency does not show clearly how the performance of averaging depends on . Therefore, we will consider an asymptotic setting below.

### 2.2 Asymptotics

To get more insightful results about the efficiency, we will take an asymptotic approach. In general, we notice that the RE only depends on the eigenvalues of the Gram matrices

and , and therefore it makes sense to study models where the eigenvalues of those matrices are precisely characterized. Indeed, we will adopt such models from asymptotic random matrix theory.

Specifically, recall that the empirical spectral distribution (e.s.d.) of a symmetric matrix is simply the CDF of its eigenvalues (which are all real-valued). More formally, it is the discrete distribution that places equal mass on all eigenvalues of

. There are many models in random matrix theory where the dimensions of the matrices grow, while with probability one, the e.s.d.

converges weakly to some limiting spectral distributions (l.s.d.) , i.e., . In that case, it follows that for suitable test functions of the eigenvalues, the trace functional 1pf(M) = ∑i=1pf(λi(M))p has a well-defined limit in terms of , namely . This explains how we can use results from random matrix theory to analyze the relative efficiency.

We will consider models such that have almost sure limiting spectral distributions . Then, we should have the limits

 \tr[(X⊤iXi)−1]→γi⋅\EFγiT−1,

assuming that the limits are finite. We will make assumptions to ensure this holds. Similarly, assuming that has almost sure limiting spectral distribution , and that the limit exists, we should have

 \tr[(X⊤X)−1]→γ⋅\EFγT−1.

Hence, the RE should converge to an asymptotic relative efficiency (ARE) of the form RE(X_1,…,X_k) →γ⋅_F_γ T^-1 ⋅∑_i=1^k 1γiFγiT-1.

Next, we will consider some specific models for , under which the ARE has a more explicit form. First, we will consider ”Marchenko-Pastur” type sample covariance matrices, which are fundamental in multivariate statistics. See e.g., Bai and Silverstein (2009); Anderson (2003); Paul and Aue (2014); Yao et al. (2015) for reference.

In this basic model, the rows of are iid -dimensional observations , for . The samples are drawn from a population with covariance matrix . The classical model is that the data points have the form , for some vector with iid entries. Arranging the data points as the rows of the data matrix , this has the form

 X=ZΣ1/2,

where the matrix has iid standardized entries, and is a deterministic positive semi-definite matrix. Let be the empirical spectral distribution of .

In this model, the Marchenko-Pastur distribution describes the weak limit of the spectral distribution of . Suppose the entries of come from an infinite array of iid variables with mean zero and variance 1, and we take a sequence of such problems with both the dimension and the sample size growing, , with asymptotically fixed aspect ratio . If the e.s.d. of converges to some limit distribution, i.e., weakly, then with probability 1, the e.s.d. of also converges, so that for a probability measure (Marchenko and Pastur, 1967; Bai and Silverstein, 2009). We assume moreover that is compactly supported away from the origin, in which case the same is true for .

In this model, we obtain the following surprisingly simple expression for the ARE.

[ARE for Marchenko-Pastur models] Consider the above high-dimensional asymptotic limit, where the data matrix is random, and its rows are iid from a population with some covariance matrix . Specifically, the data has the form , where is , and such that . Suppose the data is distributed over machines with sample sizes , and the sample sizes are all proportional to the dimension, with . Then, the ARE of the distributed one-step averaging OLS estimator with respect to the full OLS estimator equals ARE= 1-kγ1-γ. Moreover, for any finite sample size , dimension , and number of machines , we can approximate the ARE as ARE ≈n-kpn-p.

See Section 9.3 for the proof.

We find this to be a surprisingly simple formula, which can moreover be easily computed in practice. Moreover, the formula has several more interesting properties:

The ARE decreases linearly with the number of machines . This holds as long as . At the threshold case

, there is a phase transition. The reason is that there is a singularity, and the OLS estimator is not well defined anymore for at least one machine.

However, we should be somewhat cautious about interpreting the linear decrease. In some cases, it may make more sense to study the root mean squared error (RMSE). That quantity has a different loss of efficiency, namely the square root of the ARE presented above.

The ARE has two important universality properties.

First, it does not depend on how the samples are distributed across the different machines, i.e., it is independent of the specific sample sizes .

Second, it does not depend on the covariance matrix of the samples. This is in contrast to the estimation error of OLS, which does in fact depend on the covariance structure. Therefore, we think that the cancellation of in the ARE is noteworthy.

The ARE is also very accurate in simulations. See Figure 2 for an example. Here we report the results of a simulation where we generate an random matrix such that the rows are distributed independently as . We take to be diagonal with entries chosen uniformly at random between 1 and 2. We choose , and for each value of such that , we split the data into groups of a random size . To ensure that each group has a size , we first let , and then distribute the remaining samples uniformly at random. We then show the results of the expression for the RE from Lemma 2.1 compared to the theoretical ARE. We observe that the two agree closely.

It is also of interest to understand the performance of one-step averaging if we use suboptimal weights . How much do we lose compared to the optimal performance if we do not use the right weights? In practice, it may seem reasonable to take a simple average of all estimators. We have performed that analysis in Section 9.4

, and we found that the loss can be viewed in terms of an inequality between the arithmetic and harmonic means.

There are several more remarkable properties to note. We will discuss them in turn. We have studied the monotonicity properties and interpretation of the relative efficiency, see Section 9.5. Our results also show that the distributed regression estimator is minimax rate-optimal as long as the number of machines is not too large (Section 9.6).

Next we give a multi-response regression characterization that heuristically gives the correct ”

degrees of freedom” for distributed regression. This will be helpful to understand the asymptotic formulas derived above.

We re-parametrize , treating the samples on each machine as a different outcome. We write the multi-response outcome matrix , the feature matrix , and the corresponding noise as

 Y––=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣Y10…00Y2…0⋮⋮⋱⋮00…Yk⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,X––=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣X10…00X2…0⋮⋮⋱⋮00…Xk⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,\ep––––=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣\ep10…00\ep2…0⋮⋮⋱⋮00…\epk⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

We also introduce , the parameter matrix, which shares parameters across the outcomes:

 β––=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣β0…00β…0⋮⋮⋱⋮00…β⎤⎥ ⎥ ⎥ ⎥ ⎥⎦=Ik⊗β

Note that is equivalent to . The OLS estimator of is . This can be calculated as

 ^β––=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣\hbeta10…00\hbeta2…0⋮⋮⋱⋮00…\hbetak⎤⎥ ⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣(X⊤1X1)−1X⊤1Y10…00(X⊤2X2)−1X⊤2Y2…0⋮⋮⋱⋮00…(X⊤kXk)−1X⊤kYk⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

Notice that the estimators of the coefficients of different outcomes are the familiar distributed OLS estimators. Now, we can find a plug-in estimator of , based on . Given the form of above, for any vector such that , we have that

can be expressed in terms of the tensorized parameter

as a weighted combination

 β=(1⊤p⊗Ik)β––w.

Therefore, for any unbiased estimator of

, the corresponding weighted combination estimators given below are unbiased for :

 \hbeta(w)=(1⊤p⊗Ik)^β––w.

In our case, given the zeros in the estimator, this simply reduces to the weighted sum .

This explains how our problem can be understood in the framework of multi-response regression. Also, the number of parameters in that problem is , so the ”degrees of freedom” is . Indeed, the residual effective degrees of freedom is usually defined as . Let be the hat matrix on the -th machine, so that . Then it is easy to see that , for all . Since is simply the block diagonal matrix with as blocks, we see that , as required.

This provides a simple explanation for why the residual ”degrees of freedom” of a distributed estimation problem is , and also for why the relative efficiency is approximately .

### 2.3 Elliptical models

Second, we study the more general setting of elliptical data. In this model the data samples may have different scalings, having the form , for some vector with iid entries, and for datapoint-specific scale parameters . Arranging the data as the rows of the matrix , that takes the form

 X=Γ1/2ZΣ1/2,

where and are as before: has iid standardized entries, while is the covariance matrix of the features. Now is the diagonal scaling matrix containing the scales of the samples. This model has a long history in multivariate statistical analysis (e.g., Mardia et al., 1979).

In this model, it turns out that the ARE can be expressed in a simple way via the -transform (Tulino and Verdú, 2004). The -transform of a distribution is

 η(x)=\EG11+xT,

for all for which this expectation is well-defined. We will see that the ARE can be expressed in terms of the functional inverse of the -transform evaluated at the specific value : f(γ, G) = η_G^-1(1-γ).

For some insight on the behavior of and , consider first the case when is a point mass at unity, . In this case, all scales are equal, so this is just the usual Marchenko-Pastur model. Then, we have , while . See Figure 3 for the plots. The key points to notice are that is a decreasing function of , with , and . Moreover, is an increasing function on with , . The same qualitative properties hold in general for compactly supported distributions bounded away from .

In the elliptical model, we find the following expression for the ARE. [ARE for elliptical models] Consider the above high-dimensional asymptotic limit, where the data matrix is random, and the samples have the form . Suppose that, as with , the of converges weakly to , the of each converges weakly to some , and that the of converges weakly to . Suppose that is compactly supported away from the origin, is also compactly supported and does not have point mass at the origin. Then, the ARE has the form ARE = f(γ, G) ⋅∑_i=1^k 1f(γi, Gi).

See Section 9.7 for the proof.

There are two implicit relations in the above formula. First, , because . Second, , or equivalently , because contains all entries of each .

For the special case when all aspect ratios are equal, and all scale distributions are equal to , we can say more about the ARE. We have the following theorem.

[Properties of ARE for elliptical models] Consider the behavior of distributed regression in elliptical models under the conditions of Theorem 3. Suppose that the data sizes on all machines are equal, so that for all . Suppose moreover that the scale distributions on all machines are also equal. Then, the ARE has the following properties

1. It can be expressed equivalently as

 ARE(k)=k⋅η−1G(1−γ)η−1G(1−kγ)=k⋅f(γ,G)f(kγ,G)=e(γ,G)e(kγ,G).

Here is the -transform of , is defined above, while is the unique positive solution of the equation

 ∫se1+γsedG(s)=1.
2. Suppose also that does not have a point mass at the origin. Then, the ARE is a strictly decreasing smooth convex function for . Here is viewed as a continuous variable. Moreover , and

 limk→1/γARE(k)=0.

See Section 9.8 for the proof. These theoretical formulas again match simulation results well, see Figure 4. On that figure, we use the same simulation setup as for Figure 2, and in addition we choose the scale distribution to be uniform on .

The ARE for a constant scale distribution is a straight line in , ARE. For a general scale distribution, the graph of ARE is a curve below that straight line. The interpretation is that for elliptical distributions, there is a larger efficiency loss in one-step averaging. Intuitively, the problem becomes ”more non-orthogonal” due to the additional variability from sample to sample.

#### 2.3.1 Worst-case analysis

It is natural to ask which elliptical distributions are difficult for distributed estimation. That is, for what scale distributions does the distributed setting have a strong effect on the learning accuracy? Intuitively, if some of the scales are much larger than others, then they ”dominate” the problem, and may effectively reduce the sample size.

Here we show that this intuition is correct. We find a sequence of scale distributions such that distributed estimation is ”arbitrarily bad”, so that the ARE decreases very rapidly, and approaches zero even for two machines.

[Worst-case example] Consider elliptical models with scale distributions that are a mixture of small variances , and larger variances , with weights and , i.e., . Then, as , we have ARE. Therefore, the relative efficiency for any tends to zero.

See Section 9.9 for the proof.

Next, we consider more general scale distributions that are a mixture of small scales , and larger scales , with arbitrary weights and :

G=(1-c)δ_τ+cδ_ατ, where , and . To gain some intuition about the setting where , we notice that only the large scales contribute a non-negligible amount to the sample covariance matrix . Therefore, the sample size is reduced by a factor equal to the fraction of large scales, which equals . More specifically, we have

 n−1X⊤X=n−1n∑j=1xjx⊤j=α2n−1∑j≤cnzjz⊤j+n−1∑j>cnzjz⊤j≈α2n−1∑j≤cnzjz⊤j.

The last approximation follows because the sample covariance matrix has eigenvalues, out of which we expect to be large, of the order of . The remaining are smaller, of unit order. Thus, heuristically, this matrix is well approximated by a scaled sample covariance matrix of vectors. Therefore, the sample size is reduced to the number of large scales. If , the matrix is nearly singular, while if , it is well-conditioned. This should provide some intuition for the results to follow.

[More general worst-case example] Consider elliptical models with scale distribution

 G=(1−c)δτ+cδατ,

as in (2.3.1), where , and . When tends to infinity, the ARE will depend on and as summarized in Table 2.

See Section 9.10 for the proof.

As an example of special interest, if for some , then

 limα→+∞ARE(k)=⎧⎪ ⎪⎨⎪ ⎪⎩c−kγc−γ,  kM.

As before, this result can be understood in terms of reducing the effective sample size to . When , i.e., , each local problem can be well-conditioned. However, when , i.e., , all local OLS problems are ill-conditioned, so the ARE is small.

## 3 A general framework

After our study of estimation, we now introduce a more general framework, which allows us to study the behavior of many learning tasks in a unified way, including prediction error, confidence intervals (i.e., statistical inference), and regression function estimation. In the general framework, we predict linear functionals of the regression coefficients of the form

 LA=Aβ+Z.

Here is a fixed matrix, and is a zero-mean Gaussian noise vector of dimension , with covariance matrix , for some scalar parameter . We denote the covariance matrix between and by , so that . If , we say that there is no noise. In that case, we necessarily have .

We predict the linear functional via plug-in based on some estimator (typically OLS or distributed OLS)

 ^LA(\hbeta0)=A\hbeta0.

We measure the quality of estimation by the mean squared error

 M(\hbeta0)=\E∥LA−^LA(\hbeta0)∥2.

We compute the relative efficiency of OLS compared to a weighted distributed estimator :

### 3.1 Examples

We now show how several learning and inference problems fall into the general framework. See Table 3 for a concise summary. In addition to parameter estimation, we will discuss out-of-sample prediction (test error), in-sample prediction (training error), and confidence intervals.

Estimation. In parameter estimation, we want to estimate the regression coefficient vector using . This is an example of the general framework where the transform matrix is , and there is no noise (so that ).

Regression function estimation. We can use to estimate the regression function . In this case, the transform matrix is , the linear functional is , the predictor is , and there is no noise.

Out-of-sample prediction (Test error). For out-of-sample prediction, or test error, we consider a test data point , generated from the same model , where are independent of , and only is observable. We want to use to predict .

This corresponds to predicting the linear functional , so that , and the noise is , which is uncorrelated with the noise in the original problem.

In-sample prediction (Training error). For in-sample prediction, or training error, we consider predicting the response vector , using the model fit . Therefore, the functional is This agrees with regression function estimation, except for the noise which is identical to the original noise. Hence, the noise scale is , and .

Confidence intervals. To construct confidence intervals for individual coordinates, we consider the normal model . Assuming is known, a confidence interval with coverage for a given coordinate is

 \hbetaj±σzα/2V1/2j,

where is the inverse normal CDF, and is the -th diagonal entry of .

Therefore, we can measure the difficulty of the problem by . The larger is, the longer the confidence interval. This corresponds to measuring the difficulty of estimating the coordinate . This can be fit in our general framework by choosing , the vector of zeros, with only a one in the -th coordinate. This problem is noiseless.

### 3.2 Finite sample results

We now show how to calculate the efficiency explicitly in the general framework. We start with the simpler case where . We then have for the OLS estimator

 M(\hbeta)=σ2⋅(\tr[(X⊤X)−1A⊤A]).

For the distributed estimator with weights summing to one, given by , we have

Following the same steps as before, we find that the optimal efficiency is

 E(A;X1,…,Xk)=\tr[(X⊤X)−1A⊤A]1∑ki=11\tr[(X⊤iXi)−1A⊤A]=\tr[(X⊤X)−1A⊤A]⋅k∑i=11\tr[(X⊤iXi)−1A⊤A]. (1)

This shows that the key to understanding the efficiency are the traces

By following proposition 2.1 with minor modification, we can prove that the efficiency is at most unity.

[Concavity for general efficiency] The function is a concave function defined on positive definite matrices. As a consequence, the general relative efficiency for distributed estimation is at most unity for any matrices :

 E(A;X1,…,Xk)≤1.

See Section 9.11 for the proof.

For the more general case when , we can also find the OLS MSE as

 M(\hbeta) =σ2⋅[\tr((X⊤X)−1A⊤A)−2\tr(A(X⊤X)−1X⊤N)+hd].

For the distributed estimator, we can find, denoting ,

Let , and . The optimal weights can be found from a quadratic optimization problem:

 wi=λ∗+biai,λ∗:=1−∑ki=1biai∑ki=11ai.

The resulting formula for the optimal weights, and for the global optimum, can be calculated explicitly. However, we have not found the result particularly insightful, so we not report it here. The details can be found in Section 9.12.

## 4 Some related work

In this section we discuss some related work. There is a great deal of work in computer science and optimization on parallel and distributed computation (see e.g., Bertsekas and Tsitsiklis, 1989; Lynch, 1996; Blelloch and Maggs, 2010; Boyd et al., 2011; Rauber and Rünger, 2013; Koutris et al., 2018). These areas are quite mature, with well-developed algorithmic frameworks, such as the PRAM (or parallel random access memory) model, which allows precise analyses of the computational resources involved in distributed computation.

In addition, there are several popular examples of distributed computing environments, some of them specifically designed for machine learning and statistics: for instance MapReduce (Dean and Ghemawat, 2008), Hadoop (White, 2012), and Spark (Zaharia et al., 2010).

In contrast, there is less work on understanding the statistical properties, and the inherent computation-statistics tradeoffs, in distributed computation environments. This area has attracted increasing attention only in recent years, see for instance Mcdonald et al. (2009); Zhang et al. (2012, 2013b, 2013a); Duchi et al. (2014); Zhang et al. (2015); Braverman et al. (2016); Jordan et al. (2016); Rosenblatt and Nadler (2016); Smith et al. (2016); Fan et al. (2017); Lin et al. (2017); Lee et al. (2017); Battey et al. (2018); Zhu and Lafferty (2018), and the references therein. See (Huo and Cao, ) for a review. We can only discuss the most closely related papers due to space limitations.

Mcdonald et al. (2009) show finite-sample bounds on the expected error of averaged estimators in multinomial regression. Zinkevich et al. (2010)

propose parallel stochastic gradient descent algorithms, deriving the speed of convergence of parameter distributions to their asymptotic limits.

Zhang et al. (2013b) bound the leading order term for MSE of averaged estimation in empirical risk minimization. Their bounds do not explicitly take dimension into account. However, their empirical data example clearly has large

, considering a logistic regression with

, and , so that . In their experiments, they distribute the data over up to 128 machines. So, our regime, where is of the same order as matches well their simulation setup. In addition, their concern is on regularized estimators, where they propose to estimate and reduce bias by subsampling.

Liu and Ihler (2014) study distributed estimation in statistical exponential families, showing that the efficiency loss compared to the global setting relates to how much the underlying distributions deviate from full exponential families. They also propose nonlinear KL-divergence-based combination methods, which can be more efficient than linear averaging.

Zhang et al. (2015)

study divide and conquer kernel ridge regression, showing that the partition-based estimator achieves the statistical minimax rate over all estimators. Due to their generality, their results are more involved, and also their dimension is fixed.

Lin et al. (2017) improve those results. Duchi et al. (2014) derive minimax bounds on distributed estimation where the number of bits communicated is controlled.

Rosenblatt and Nadler (2016) consider the distributed learning problem in three different settings. The first two settings are fixed dimensional. The third setting is high-dimensional M-estimation, where they study the first order behavior of estimators using prior results from Donoho and Montanari (2013); El Karoui et al. (2013). This is possibly the most closely related work to ours in the literature. They use the following representation, derived in the previous works mentioned above: a high-dimensional -estimator can be written as , where , and

is a constant depending on the loss function, whose expression can be found in

Donoho and Montanari (2013); El Karoui et al. (2013).

They derive a relative efficiency formula in this setting, which for OLS takes the form

In contrast, our result for this case is equal to

 1−γ1−kγ=1+γk−11−kγ.

Thus, our result is much more precise, and in fact exact, while of course being limited to the special case of linear regression.

Lee et al. (2017) study sparse linear regression, showing that averaging debiased lasso estimators can achieve the optimal estimation rate if the number of machines is not too large. Battey et al. (2018) study a similar problem, also including hypothesis testing under more general sparse models.

## 5 Calculus of deterministic equivalents

### 5.1 A calculus of deterministic equivalents in RMT

In the previous section we saw that the relative efficiency depends on the trace functionals , for specific matrices . To find the limits of these functionals, we will use the technique of deterministic equivalents from random matrix theory. This is a method to find the almost sure limits of random quantities. See for example Hachem et al. (2007); Couillet et al. (2011) and the related work section below.

For instance, the Marchenko-Pastur (MP) law itself states that the eigenvalue distribution of certain random matrices is asymptotically deterministic. More generally, one of the best ways to understand the MP law is that resolvents are asymptotically deterministic. Indeed, let , where and has iid entries. Then the MP law means that for any with positive imaginary part, we have the equivalence

 (\hSigma−zI)−1≍(xpΣ−zI)−1,

for a certain scalar . At this stage we can think of the equivalence entry-wise, but we will make this precise next. The above formulation has appeared in some early works by VI Serdobolskii, see e.g., Serdobolskii (1983), and Theorem 1 on page 15 of Serdobolskii (2007) for a very clear statement.

The consequence is that simple linear functionals of the random matrix have a deterministic equivalent based on . In particular, we may be able to express the trace functionals we need in a simpler deterministic way. In order to do this, we will take a principled approach and define some appropriate notions for a calculus of deterministic equivalents, which allows us to do calculations in a simple and effective way.

First, we make more precise the notion of equivalence. We say that the (deterministic or random) symmetric matrix sequences of growing dimensions are equivalent, and write

 An≍Bn

if

 limn→∞∣∣\tr[Cn(An−Bn)]∣∣=0

almost surely, for any sequence of symmetric deterministic matrices with bounded trace norm, i.e., such that

 limsup∥Cn∥tr<∞.

We call such a sequence a standard sequence. Recall here that the trace norm (or nuclear norm) is defined by , where

are the singular values of

. Since the matrices considered here are real symmetric, we also have that , where are the eigenvalues of .

### 5.2 General MP theorem

To find the limits of the efficiencies, the most important deterministic equivalent will be the following general Marchenko-Pastur theorem. See Section 9.13 for the proof, which relies on the generalized Marchenko-Pastur theorem of Rubio and Mestre (2011).

[Deterministic equivalent in elliptical models, consequence of Rubio and Mestre (2011)] Let the data matrix follow the elliptical model

 X=Γ1/2ZΣ1/2,

where is an diagonal matrix with non-negative entries representing the scales of the observations, and is a positive definite matrix representing the covariance matrix of the features. Assume the following:

1. The entries of

are iid random variables with mean zero, unit variance, and finite

-th moment, for some

.

2. The eigenvalues of , and the entries of , are uniformly bounded away from zero and infinity.

3. We have , with bounded away from zero and infinity.

Let be the sample covariance matrix. Then is equivalent to a scaled version of the population covariance

 \hSigma−1≍Σ−1⋅ep.

Here is the unique solution of the fixed-point equation

 1 =1n\tr[epΓ(I+γpepΓ)−1].

Thus, the inverse sample covariance has a deterministic equivalent in terms of a scaled version of the inverse population covariance. This result does not require the convergence of the aspect ratio , or of the e.s.d. of . However, if the empirical spectral distribution of tends to , the above equation has a limit which agrees with the previous equation, namely

 ∫se1+γsedG(s)=1.

The usual MP theorem is a special case of the above result where . As a result, we obtain the following corollary:

[Deterministic equivalent in MP models] Let the data matrix follow the model where is a positive definite matrix representing the covariance matrix of the features. Assume the same conditions on from Theorem 5.2. Then is equivalent to a scaled version of the population covariance

 \hSigma−1≍11−γp⋅Σ−1.

The proof is immediate, by checking that in this case.

#### 5.2.1 Related work on deterministic equivalents

There are several works in random matrix theory on deterministic equivalents. One of the early works is Serdobolskii (1983), see Serdobolskii (2007) for a modern summary. The name ”deterministic equivalents” and technique was more recently introduced and re-popularized by Hachem et al. (2007) for signal-plus-noise matrices. Later Couillet et al. (2011) developed deterministic equivalents for matrix models of the type , motivated by wireless communications. See the book Couillet and Debbah (2011) for a summary of related work. See also Müller and Debbah (2016) for a tutorial. However, many of these results are stated only for some fixed functional of the resolvent, such as the Stieltjes transform. One of our points here is that there is a much more general picture.

Rubio and Mestre (2011) is one of the few works that explicitly states more general convergence of arbitrary trace functionals of the resolvent. Our results are essentially a consequence of theirs.

However, we think that it is valuable to define a set of rules, a ”calculus” for working with deterministic equivalents, and we use those techniques in our paper. Similar ideas for operations on deterministic equivalents have appeared in Peacock et al. (2008), for the specific case of a matrix product. Our approach is more general, and allows many more matrix operations, see below.

### 5.3 Rules of calculus

The calculus of deterministic equivalents has several properties that simplify calculations. We think these justify the name of calculus. Below, we will denote by etc, sequences of deterministic or random matrices. See Section 9.14 for the proof.

[Rules of calculus] The calculus of deterministic equivalents has the following properties.

1. Equivalence. The relation is indeed an equivalence relation.

2. Sum. If and , then .

3. Product. If is a sequence of matrices with bounded operator norms i.e., , and , then .

4. Trace. If , then almost surely.

5. Stieltjes transforms. As a consequence, if , then almost surely. Here is the Stieltjes transform of the empirical spectral distribution of .

In addition, we note that the calculus of deterministic equivalents has additional properties, such as continuous mapping theorems, differentiability, etc. However, we do not need those explicitly in the current paper, so we leave them for future work.

## 6 Examples

We now show how to use the calculus of deterministic equivalents to find the limits of the trace functionals in our general framework. We study each problem in turn.

### 6.1 Regression function estimation

For estimating the regression function, we have . We then find via equation (1) the prediction efficiency

 FE(X1,…,Xk)=k∑i=1p\tr((X⊤iXi)−1X⊤X)).

For asymptotics, we consider as before elliptical models.

[FE for elliptical and MP models] Under the conditions of Theorems 5.2 and 3, the FE has the almost sure limit

 FE(X1,…,Xk)→a.s.k∑i=111+(1γ\EGT−1γi\EGiT)f(γi,Gi).

Under the conditions of Corollary 5.2, the FE has the almost sure limit . So for Marchenko-Pastur models, the limit is the same as for parameter estimation from Theorem 2.2.

See Section 9.15 for the proof. This efficiency is more complex than that for estimation error. In particular, it does not depend on a simple linear way on , but rather via a ratio of two linear functions of . However, it can be checked that many of the properties (e.g., monotonicity) for ARE still hold here.

### 6.2 In-sample prediction (Training error)

For in-sample prediction, we start with the well known formula

 \E||X(β−^β)+\ep||2=σ2[n−\tr((X⊤X)−1X⊤X]=σ2(n−p).

As we saw, to fit in-sample prediction in the general framework, we need to take the transform matrix , the noise , and the covariance matrices