On the Optimality of Averaging in Distributed Statistical Learning

07/10/2014 ∙ by Jonathan Rosenblatt, et al. ∙ Weizmann Institute of Science 0

A common approach to statistical learning with big-data is to randomly split it among m machines and learn the parameter of interest by averaging the m individual estimates. In this paper, focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error incurred by this strategy. We consider two large-sample settings: First, a classical setting where the number of parameters p is fixed, and the number of samples per machine n→∞. Second, a high-dimensional regime where both p,n→∞ with p/n →κ∈ (0,1). For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-p setting, under suitable assumptions, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second order error terms, and show that these can be non-negligible, notably for non-linear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: data splitting incurs a first-order accuracy loss, which to leading order increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Big-data era, characterized by huge datasets and an appetite for new scientific and business insights, often involves learning statistical models of great complexity. Typically, the storage and analysis of such data cannot be performed on a single machine. Several platforms such as Map-Reduce (Dean and Ghemawat, 2008), Hadoop (Shvachko et al., 2010), and Spark (Zaharia et al., 2010) have thus become standards for distributed learning with big-data.

These platforms allow learning in an “embarrassingly parallel” scheme, whereby a large dataset with observations is split to machines, each having access to only a subset of samples. Approaches to “embarrassingly-parallel” learning can roughly be categorized along the output of each machine: predictions, parameters or gradients. In this paper we consider the second, whereby each of the individual machines fits a model with parameters and transmits them to a central node for merging. This split-and-merge strategy, advocated by Mcdonald et al. (2009) for striking the best balance between accuracy and communication, is both simple to program and communication efficient: only a single round of communication is performed and only to a central node. It is restrictive in that machines do not communicate between themselves, and splitting is done only along observations and not along variables. For an overview of more general distributed learning strategies see for example Bekkerman et al. (2011).

Our focus is on the statistical properties of this split-and-merge approach, under the assumption that the data are split uniformly at random among the machines. In particular, we study the simplest merging strategy, of averaging the individual machine estimates, denoted as the Mixture Weight Method in Mcdonald et al. (2009). In this context we ask the following questions: (i) what is the estimation error of simple averaging as compared to a centralized solution? (ii) what is its distribution? (iii) under which criteria, if any, is averaging optimal? and (iv) how many machines to deploy?

Mcdonald et al. were among the first to study some of these issues for multinomial regression (a.k.a. Conditional Maximum Entropy), deriving finite sample bounds on the expected error of the averaged estimator (Mcdonald et al., 2009, Theorem 3). In a follow-up work, Zinkevich et al. (2010)

compared the statistical properties of the averaged estimator to the centralized one for more general learning tasks, assuming each machine estimates the model parameters by stochastic gradient descent. More recently, under appropriate conditions and for a large class of loss functions,

Zhang et al. (2013b, Theorem 1) derived bounds for the leading order term in the mean squared error (MSE) of the averaged estimator and provided the rates of higher order terms. They further proposed several improvements to the simple averaging strategy that reduce the second order term in the MSE, and reduce the machine-wise run time via modified optimization algorithms.

In this paper we extend and generalize these previous works in several aspects. First, in Section 3 we study the statistical properties of the averaged estimator, when the number of parameters is fixed, under conditions similar to those of Zhang et al. (2013b). Using the classical statistical theory of M-estimators (Vaart, 1998; Rieder, 2012), we provide not only asymptotic bounds on the MSE, but rather an asymptotic expansion of the error itself. This allows us to derive the exact constants in the MSE expansion, and prove that as , the MSE of the averaging strategy in fact equals that of the centralized solution. Put differently, for various learning tasks, when the number of machines and their available memory are such that in each machine there are many observations per parameter (), then averaging machine-wise estimates is as accurate as the centralized solution. Furthermore, if the centralized solution enjoys first-order statistical properties such as efficiency or robustness, then so will the parallelized solution. We remark that for maximum likelihood problems, independently of our work, the asymptotic agreement between centralized and averaged estimators was also noted by Liu and Ihler (2014)

. The asymptotic representation of the averaged estimator also readily yields its limiting distribution. This allows to construct confidence intervals, perform hypothesis tests on the unknown parameters and feature selection without the need for computationally intensive procedures such as Bootstrapping.

The first-order equivalence between the averaged and centralized estimators may seem as a free lunch: run-time speedups with no accuracy loss. Distributed estimation via split-and-average, however, does incur an accuracy loss captured in the higher order error terms. The classical theory of M-estimators permits the derivation of these terms, in principle up to an arbitrary order. We do so explicitly up to second order, revealing the accuracy loss of split-and-average schemes.

In Section 4 we consider the statistical effects of data-splitting in a high-dimensional regime, where the model dimension , grows with the number of observations : with . Our motivation comes from modern day data analysis practices, where increasingly complicated models are considered as more data is made available. This is a challenging regime in that typically machine-wise estimates are not only inconsistent, but in fact do not even converge to deterministic quantities. Here, in the absence of a general theory of M-estimators, we restrict our analysis to generative linear models. In contrast to the fixed- setting, in this high-dimensional regime there is a first order accuracy loss due to the split data, which increases (approximately) linearly with the number of machines. Luckily, in several practical situations, this accuracy loss is moderate. Our analysis builds upon the recent results of El Karoui et al. (2013) and Donoho and Montanari (2013), and to the best of our knowledge, is the first to study the error loss of parallelization in this high-dimensional regime.

In Section 5 we present several simulations both in the fixed- and in the high-dimensional regime that illustrate the utility but also the limitations of our results. These confirm that when learning linear models with abundant data, random splitting and averaging is attractive both computationally and statistically. In contrast, for non-linear models, the accuracy loss due to data splitting can be considerable.

In Section 6 we attend to practical considerations, such as parallelization vs. sub-sampling and the choice of number of machines, . For the latter, we distinguish between parallelization due to memory constraints, and that motivated by run-time speedups. For these two scenarios we formulate the choice of as optimization problems constrained on the desired error level. Interestingly, when motivated by run-time speedups, using our approximations for the estimation error, and varying traces the accuracy-complexity tradeoff facing the practitioner.

We conclude with a discussion and several further insights in Section 7. All proofs appear in the appendices.

2 Problem Setup

We consider the following general statistical learning setup: Let

be a random variable defined on an instance space

and having an unknown density . Also, let the parameter space be an open convex subset of Euclidean space, and let denote a loss function. Our interest is to estimate the -dimensional parameter that minimizes the population risk

(1)

In the following, we assume that exists in and is unique. Given i.i.d. samples of the r.v. , a standard approach, known as M-estimation or empirical risk minimization (ERM), is to calculate the estimator that minimizes the empirical risk

(2)

This framework covers many common unsupervised and supervised learning tasks. In the latter,

consists of both features and labels . There is by now an established theory providing conditions for to be a consistent estimator of , and non asymptotic bounds on its finite sample deviation from (see  Devroye et al. (1997); Shalev-Shwartz and Ben-David (2014) and references therein).

In this paper we consider a big-data setting, whereby the number of samples is so large that instead of minimizing Eq.(2) on a single machine, the data is randomly allocated among machines, each having access to only a subset of size . In line with the Map-Reduce workflow, a typical approach in this distributed scenario is that each machine computes its own M-estimator and transmits it to a central node for further processing. In this work we focus on the most common aggregation procedure, namely simple averaging

(3)

where denotes the -th machine minimizer of Eq. (2) over its own observed data.

Our questions of interest are: (i) what is the accuracy of vs. that of ? (ii) what are the statistical properties of ? (iii) under which criteria, if any, is optimal? and (iv) how many machines to deploy?

3 Fixed- Setting

First, we consider the error of the split-and-average estimator of Eq.(3), when data is abundant and the model dimension and number of machines are both fixed. In this setting, bounds on the were derived by both Zhang et al. (2013b) and Mcdonald et al. (2009). For the particular case of maximum likelihood estimation, Liu and Ihler (2014, Theorem 4.6) derived the exact asymptotic expression of the first two leading error terms in the MSE, as . We take a similar approach but for the more general M-estimators. Instead of focusing on the MSE, we derive an exact asymptotic representation of the first two terms in the error itself.

3.1 First Order Statistical Properties of Averaging

We start by analyzing the exact asymptotic expression for the dominant error term. We make the following standard assumptions (Vaart, 1998, Theorem 5.23), similar to those made in Zhang et al. (2013b):

Assumption Set 1.
  1. [label=A0, ref=(A0),leftmargin=5.0em]

  2. is consistent: .

  3. admits a second order Taylor expansion at with non singular Hessian .

  4. is differentiable at

    almost surely (a.s.) or in probability.

  5. is Lipschitz near : with Lipschitz coefficient bounded in squared expectation, .

Our first result, formally stated in the following theorem, is that under Assumption Set 1 averaging machine-wise estimates enjoys the same first-order statistical properties as the centralized solution.

Theorem 1.

Under Assumption Set 1, as with fixed, and any norm

(4)

We say that two estimators are first-order equivalent if their leading error terms converge to the same limit at the same rate, with the same limiting distribution. Assumption Set 1 implies that converges to at rate (Vaart, 1998, Corollary 5.53). Theorem 1 thus directly implies the following:

Corollary 1.

The averaged estimator is first-order equivalent to the centralized solution .

Remark 1.

In practice, Eq.(2) is minimized only approximately, typically by some iterative scheme such as gradient descent (GD), stochastic gradient descent (SGD), etc. An important point is that Theorem 1 holds not only for the exact empirical minimizer of Eq.(2), but also for any approximate minimizer as long as it satisfies (Vaart, 1998, Theorem 5.23). In other words, for Corollary 1 to hold, it suffices to minimize the empirical risk up to precision.

Theorem 1 has important implications on the statistical properties of , its optimality and robustness . We discuss these in detail below, but before, let us describe the scope which this theorem covers.

Scope

As detailed further in Appendix H, the learning tasks covered by Theorem 1

are quite broad, and include: linear or non-linear regression with

, Huber, or log likelihood loss; linear or non-linear quantile regression with continuous predictors; binary regression where

for any smooth and , log likelihood or Huberized hinge loss111 A smooth version of the Huber loss (Rosset and Zhu, 2007).

; binary hinge loss regression (i.e. SVM regression) with continuous predictors; unsupervised learning of location and scale. Furthermore, Theorem 

1 also covers regularized risk minimization with a fixed regularization term , of the form , provided that the modified loss function satisfies the required assumptions.

Some learning problems, however, are not covered by Theorem 1. Examples include: non-uniform allocation of samples to machines; non-convex parameter spaces; a data driven regularization term; non differentiable loss with discrete predictors. Also not covered is the regime, in which Shamir et al. (2013) showed that averaging (denoted there as One Shot Averaging) can, in general, be unboundedly worse than the centralized solution.

On the optimality of averaging.

Recall that common notions of asymptotic optimality, such as Best Regular and Local Minimax depend only on the leading order error term (Vaart, 1998, Chapter 8). Hence, if the centralized estimator is optimal w.r.t. any of these criteria, Eq.(4) readily implies that so is the averaged estimate . A notable example, discussed in (Zhang et al., 2013b, Corollary 3) and in Liu and Ihler (2014), is when the loss function is the negative log likelihood of the generative model. The centralized solution, being the maximum-likelihood estimate of , is optimal in several distinct senses. Theorem 1 thus implies that is optimal as well and the factor in Eq.(4) cannot be improved.

Robustness.

An important question in distributed learning is how to handle potential outliers: should these be dealt with at the machine-level, the aggregation level, or both? Recall that the robustness literature mostly considered the construction of estimators having minimal asymptotic variance, under the constraint of bounded influence of individual observations. For estimating the mean of a Gaussian distribution under possible contamination, Huber derived his famous loss function, and proved it to be optimal. As the Huber-loss yields an M-estimator that satisfies the assumptions of Theorem 

1, it thus follows that averaging machine-wise robust estimators is optimal in the same sense.

Hence, if the probability of a high proportion of outliers in any machine is negligible, and machine-failure is not a concern, it suffices to deal with outliers at the machine level alone. In other cases robust aggregation functions should be considered (Hsu and Sabato, 2013; Feng et al., 2014).

Asymptotic Linearity.

The proof of Theorem 1

relies on the asymptotic linearity of the estimator in some non-linear transformation of the samples. This is known as the

asymptotic linearity property and the corresponding transformation is the Influence Function. Asymptotic linearity holds for several other estimators, including L, R and Minimum Distance. Hence, first-order equivalence of averaging to the centralized solution is rather general. It typically holds for asymptotically Gaussian estimators (Rieder, 2012, Chapter 1,6) and has also been observed in other contexts, such as that of particle filters (Achutegui et al., 2014).

Limiting Distribution

The asymptotic linearity of in the influence function immediately offers the following limiting Gaussian distribution:

Corollary 2 (Asymptotic Normality).

Under the assumptions of Theorem 1, when with fixed, then converges in distribution to

Corollary 2 allows to construct confidence intervals and test hypotheses on the unknown . To this end, the asymptotic covariance matrix also needs to be estimated. Plugging any consistent estimator for the covariance matrix will conserve the asymptotic normality via Slutsky’s Theorem.

3.2 Second Order Terms

As we show empirically in Section 5, relatively little accuracy is lost when parallelizing a linear model but much can be lost when the model is non-linear. One reason is that the second order error term may be non-negligible. As discussed in Section 6, this term is also imperative when deciding how many machines to deploy, as the first-order approximation of the error does not depend on for fixed .

Before studying this second order term, let us provide a high level view. Intuitively, the first-order term captures estimation variance, which is reduced by averaging. The second order term captures also bias, which is not reduced by averaging. We would thus expect some second order suboptimality when parallelizing. Indeed, Theorem 2 below shows that the (second order) bias in a parallelized estimator is times larger than that of the centralized one. The comparison between the second order MSE matrix of the parallelized and centralized estimators is more complicated. Theorem 3 provides an explicit expression, whose terms ultimately depend on the curvature of the risk at .

3.2.1 Notation and Assumptions

To study the second order error of , we make suitable assumptions that ensure that the machine-wise M-estimator admits the following higher-order expansion,

(5)

where denotes the error term in and . The following set of assumptions with is sufficient for Eq.(5) to hold, see Rilstone et al. (1996).

Assumption Set 2.

There exist a neighborhood of in which all of the following conditions hold:

  1. [label=B0, ref=(B0),leftmargin=5.0em]

  2. Local differentiability: up to order , exist a.s. and .

  3. Bounded empirical Hessian: .

  4. Lipschitz gradients: , where .

For future use, and following the notation in Rilstone et al. (1996), we define the following

column vector

, and matrices ,

(6)

3.2.2 Second Order Bias

Let denote the second order bias of w.r.t. :

(7)

The following theorem, proven in Appendix B, shows that under our assumptions averaging over machines is (up to second order) times more biased than the centralized solution.

Theorem 2 (Second Order Bias).

Under Assumption Set 2 with , and so that

(8)
Remark 2.

The second order bias can be reduced at the cost of a larger first-order error, i.e., trading bias for variance. In general, this should be done with caution, since in extreme cases debiasing may inflate variance infinitely (Doss and Sethuraman, 1989). Approaches to reduce the second order bias include that of Kim (2006) who modifies the machine-wise loss function, Liu and Ihler (2014) who propose a different aggregation of the machine-wise estimates, and Zhang et al. (2013b), whose SAVGM algorithm estimates the machine-wise bias via bootstrap. A different approach is to trade bias for communication. Recent works that reduce the bias by allowing communication between the machines include the DDPCA algorithm (Meng et al., 2012), which transfers parts of the inverse Hessian between machines and DANE (Shamir et al., 2013), which transfers gradients.

3.2.3 Second Order MSE

Following Rilstone et al. (1996), for any estimator based on samples, we denote by its second order MSE matrix,

(9)

It follows from Rilstone et al. (1996, Proposition 3.4) that under Assumption Set 2 with

(10)

The following theorem compares between and .

Theorem 3 (Second Order MSE).

Under Assumption Set 2 with , the matrix is given by

(11)

Furthermore, the excess second order error due to parallelization is given by

(12)

In general, the second order MSE matrix of Eq.(10) need not be positive definite (PD)  (Rilstone et al., 1996). Note that since both matrices and are PD by definition, a simple condition to ensure that both and are PD is that is PD. If this holds, then parallelization indeed deteriorates accuracy, at least up to second order.

Remark 3.

Even if the second order MSE matrix is PD, due to higher order terms, parallelization may actually be more accurate than the centralized solution. An example is ridge regression with a fixed penalty and null coefficients (i.e.,

). The regularization term, being fixed, acts more aggressively with observations than with . The machine-wise estimates are thus more biased towards than the centralized one. As the bias acts in the correct direction, is more accurate than . Two remarks are, however, in order: (a) This phenomenon is restricted to particular parameter values and shrinkage estimators. It does not occur uniformly over the parameter space. (b) In practice, the precise regularization penalty may be adapted to account for the parallelization, see for example Zhang et al. (2013a).

3.3 Examples

We now apply our results to two popular learning tasks: ordinary least squares (OLS) and ridge regression, both assuming a generative linear model. We study these two cases not only due to their popularity, but also as they are analytically tractable. As we show below, parallelizing the OLS task incurs no excess bias, but does exhibit excess (second order) MSE. The ridge problem, in contrast, has both excess bias and excess (second order) MSE.

3.3.1 Ols

Consider the standard generative linear model where the explanatory variable satisfies , and the noise is independent of with mean zero and . The loss is , whose risk minimizer is the generative parameter, . The following proposition, proved in Appendix D, provides explicit expressions for the second order MSE matrix.

Proposition 1 (OLS Error Moments).

For the OLS problem, under the above generative linear model,

Inserting these expressions into Theorem 2 yields that the second order bias vanishes both for the individual machine-wise estimators and for their average, i.e., . Combining Proposition 1 with Theorem 3 yields the following expressions for the parallelized second order MSE and the excess error,

(13)

In OLS, parallelization thus incurs a second order accuracy loss, since is PD.

3.3.2 Ridge Regression

Next, we analyze ridge regression under the same generative model but now with the ridge penalty The risk minimizer now equals .

Adding the simplifying assumption that , and denoting , , and , we obtain the following result.

Proposition 2 (Ridge Error Moments).

For the ridge regression problem, under the above conditions, the matrices that control the second order bias and MSE of Eq.(3.2.1) are given by

Corollary 3 (Ridge Second Order Bias).

Combining Proposition 2 with Theorem 2, under a linear generative model, the second order bias of the parallelized ridge regression estimate is

Corollary 4 (Ridge Second Order MSE).

Combining Proposition 2 with Theorem 3, under a linear generative model, the parallelized second order MSE matrix and excess MSE are given by

(14)

and

As in the OLS case, since is an outer product, and

is a scaled identity matrix, it follows that both

and are PD matrices. Despite this result, as discussed in Remark 3, it is still possible that is a negative definite matrix due to higher order error terms, implying that parallelized ridge regression can be more exact than the centralized estimator. This has been confirmed in simulations (not included).

4 High-Dimensional Approximation

As reviewed in Section 1, most existing theory on parallelization assumes a fixed-, independent of

. This is implied by the assumption that the empirical risk gradients have uniformly bounded moments, independent of

(e.g. Zhang et al., 2013b, Assumption 3). However, it is common practice to enrich a model as more data is made available, to the extent that the number of unknown parameters is comparable to the number of samples. If is comparable to , and both are large, then the approximations of Section 3 may underestimate the parallelization’s excess error. To address this setting, we now perform a high-dimensional analysis where and .

To the best of our knowledge there is no general theory for the behavior of M-estimators is this regime. To gain insight into the statistical properties of parallelization in this high-dimensional setting, we restrict our focus to generative linear models for which the appropriate theory has been developed only recently. Building on the works of Donoho and Montanari (2013) and El Karoui et al. (2013), we thus consider a random variable

consisting of a vector of predictor variables (

) and a scalar response variable

, which satisfy the following assumptions:

Assumption Set 3.
  • The observed data are i.i.d. from the random variable , with invertible .

  • Linear generative model: , where .

  • The noise random variable has zero mean, finite second moment, and is independent of .

  • The loss is smooth and strongly convex.

Unlike the fixed- case, in the high-dimensional regime where together, each machine-wise estimate is inconsistent. As shown by El Karoui et al. (2013), and Donoho and Montanari (2013), when Assumption Set 3 holds, then as with ,

(15)

where and is a deterministic quantity that depends on , on the loss function and on the distribution of the noise .

Using the above result, we now show that in contrast to the fixed- setting, averaging is not even first-order equivalent to the centralized solution. The following lemma, proven in Appendix G.1, quantifies this accuracy loss showing that, typically, it is moderate.

Lemma 1.

Under Assumption Set 3, as

(16)

where

(17)

and

(18)

where .

Remark 4.

For simplicity of exposition, we followed the assumptions of Bean et al. (2013, Result 1). However, many of these can be relaxed, as discussed by El Karoui (2013). In particular, need not be Gaussian provided that it is asymptotically orthogonal and exponentially concentrating; need not be strongly convex nor infinitely differentiable and an interplay is possible between assumptions on the tail mass of and . Note however, that for our perturbation analysis on the behavior of as to hold, we assume the loss is at-least six times differentiable with bounded sixth derivative.

Remark 5.

There are cases where can be evaluated exactly, without recurring to approximations. One such case is least squares loss with arbitrary noise , satisfying C3, in which Wishart theory gives (El Karoui et al., 2013). A second order Taylor approximation of this exact result yields in accord with our Eqs.(16) and (17). A second case where an exact formula is available is loss with Gaussian errors: A second order Taylor expansion of the closed form solution derived in (El Karoui et al., 2013, Page 3) gives , again consistent with Eq.(16).

Typical Accuracy Losses

In classical asymptotics where , Eq.(16) is consistent with the results of Section 3 in that splitting the data has no (first-order) cost. In practical high-dimensional scenarios, where the practitioner applies the “no less than five observations per parameter” rule of thumb (Huber, 1973), the resulting value of is at most 0.2. The accuracy loss of splitting the data is thus small provided that the ratio is small. As shown in Table 1, for several loss functions with either Gaussian or Laplace errors, this ratio is approximately one.

Loss Gaussian Laplace
Squared
Pseudo Huber  ;
Absolute Loss
Table 1: The ratio for different loss functions and noise, , distributions (Gaussian and Laplace). From Eq.(16), small values of imply a small accuracy loss when parallelizing.
Limiting Distribution

Similar to the fixed regime, using Eq.(15) we can derive the following limiting distribution of in the high-dimensional regime which is immediate from the results of Bean et al. (2013, p.1 in SI), and the fact that has times less variance than .

Corollary 5 (Asymptotic Normality).

Under the assumptions of Lemma 1, for a fixed contrast , as with , then

5 Simulations

We perform several simulations to validate our results and assess their stability in finite samples. For reproducibility, the R simulation code is available at https://github.com/johnros/ParalSimulate.

Figure 1: The error ratio as a function of sample size in the fixed- regime. The center point is the median over replications and the bars represent the median absolute deviation. In all four panels . Color encodes the number of machines

. The learning tasks in the four panels are: (a) ordinary least squares; (b) ridge regression; (c) non-linear least squares; (d) logistic regression. Data was generated as follows:

; , and for ; ; In (a)-(b), the response was drawn from ; In (b) ; In (c) , whereas in panel (d), .

We start with the fixed-, large- regime. Figure 1 shows the empirical median and median absolute deviation of the individual ratios as a function of sample size , with growing as well. As seen from this figure, and in accord with Theorem 1, for large is asymptotically equivalent to and the error ratio tends to one. We also see that for small to moderate , parallelization may incur a non-negligible excess error, in particular for non-linear models.

Figure 2 presents the empirical bias and MSE of in OLS, as a function of number of machines with fixed, and compares these to their theoretical approximations from Section 3.3.1. In accord with Proposition 1, the parallelized OLS estimate shows no excess bias. In this OLS case, a high-dimensional approximation of the MSE is identical to the fixed- in panel (b) so the plot is omitted. We thus conclude that both the fixed- and the high-dim approximations of the MSE are quite accurate for small (i.e., large-), but underestimate the error as departs from .

Figure 2: Bias and MSE in OLS as a function of number of machines , with a fixed total number of samples , for different dimensions , averaged over replications. Panel (a) shows, in dotted lines, the mean of the empirical bias of an arbitrary coordinate in the parallelized estimator . The solid line is the theoretical expression for the second order bias, which, as discussed in Section 3.3.1, is precisely zero. Panel (b) shows, in dotted lines, the empirical mean squared error of the parallelized estimator, as a function of . The solid lines are the theoretical approximation using the second order fixed- expansion , with from Eq.(13). In (b) the -axis is scaled. Data was generated as follows: ; where for . where .
Figure 3: Bias and MSE in ridge regression as a function of number of machines , with a fixed total number of samples , for different dimensions , averaged over replications. Panel (a) shows, in dotted lines, the mean of the empirical bias of an arbitrary coordinate in the parallelized estimator, . The solid line is the theoretical expression for the second order bias from Section 3.3.2. Panel (b) shows, in dotted lines, the empirical MSE of the parallelized estimator, as a function of . The solid lines are the theoretical approximation using a second order fixed- expansion of the error: where is given in Eq.(14). In (b) the -axis is scaled. Data was generated as follows: is fixed at . ; where for . where .

Figure 3 is similar to Figure 2, but for ridge regression. We see that, unlike the OLS problem, the ridge problem does have parallelization bias, as predicted by our analysis in Section 3.3.2. While our fixed- MSE approximation is accurate for small (i.e., large-), for larger the empirical error is smaller than that predicted by our second order analysis. This suggests that higher order error terms in the MSE matrix are negative definite (see also Remark 3).

Figure 4: MSE ratio in the high-dimensional regime (with fixed ) as a function of sample size , averaged over 500 replications. Color encodes different number of machines . The four panels depict different learning tasks as in Figure 1. Data was generated as follows: ; , and for ; . In (a)-(b), . In (b) . In (c) , whereas in (d) .

Next, we consider the high-dimensional regime. Figure 4 shows as a function of machine-wise sample size , while holding and fixed. In contrast to the fixed- regime, here there is a first-order accuracy loss, and even for large the MSE ratio does not converge to one. In the OLS case, where our high-dimensional approximations are applicable, they are indeed accurate over a wide range of values of and . As already observed in the fixed- regime, non-linear models (panels c and d) incur a considerable parallelization excess error.

6 Practical Considerations

Parallelization is not necessarily the preferred approach to deal with massive datasets. In principle, when an easy, though potentially not sufficiently accurate solution, is to discard observations by randomly subsampling the data. Parallelization should thus be considered when the accuracy attainable by subsampling is not satisfactory. An important question is then over how many machines should the practitioner distribute the data? When tackling this question, we distinguish between two scaling regimes: fixed or fixed. Fixed captures the single-machine storage constraint: the total available data is virtually infinite and using more machines allows processing of more data, and hence better accuracy, at an obvious financial cost. Fixed captures either sampling or computational constraints: here, the total sample size is fixed and processing it on a single machine might be too slow. Thus, splitting the data reduces run-time but also decreases the accuracy. In other words, by parallelizing, we trade accuracy for speed. Interestingly, when the number of samples is fixed, by using our approximations and varying , we are able to trace the accuracy-complexity tradeoff facing the practitioner. An informed choice of is thus choosing either a desirable run-time, or a desired error level, on this curve.

We now formulate the target functions for choosing the number of machines in these two regimes. For fixed , wishing to minimize costs, we analyze what is the minimal number of machines that attains a desired accuracy, :

(19)

For fixed , wishing to minimize runtime, and in the spirit of Shalev-Shwartz and Srebro (2008), we ask what is the maximal number of machines so that runtime is minimized while a desired level of accuracy is maintained. Choosing the number of machines in the fixed scenario reduces to solving

(20)

Next, let us study these two optimization problems, Eqs.(19) and (20), when the accuracy measure is . This is challenging or even infeasible, since in general we do not have explicit expressions for this quantity. Moreover, in the fixed- regime, approximating the MSE by the asymptotic leading error term yields that this quantity is independent of ! As we show below, meaningful and interesting solutions to this optimization problems arise when we approximate by the second order expression in the fixed- regime. Specifically, using Eq.(11) we approximate . In the high-dimensional regime, in contrast, the optimization problems (19) and (20) are well posed already when we approximate the MSE by the first order term. Relying on Eq.(G.2) gives .

We now present the optimization problems corresponding to the fixed- approximation in each scaling scenario:

Fixed-n:

which stems from Eq.(19) and Eq.(11).

Fixed-N:

which stems from Eq.(20) and Eq.(11) with .

Let us illustrate these formulas in the OLS example from Section 3.3.1. The required quantities for OLS are collected in Appendix D. For example, solving the fixed- problem, the maximal number of machines that will keep the per-coordinate MSE under , i.e., , with , , and is . Alternatively, assuming an abundance of data and a memory limit such that , we solve the fixed- problem to find that will satisfy the derived error level.

Remark 6.

In some cases, the practitioner may wish to control the parallelization error relative to the centralized solution, and not as an absolute value as analyzed above. Namely, the restriction is now . The scenarios (fixed /) and approximations previously discussed apply here as well. For example, in our OLS example, solving the Fixed- problem with , , and , yields that for to err no more than more than (). On the other hand, For the Fixed- problem, with , , and , we can parallelize up to machines, and still maintain the same excess error allowance.

7 Discussion

In this work we studied the error of parallelized M-estimators when observations are uniformly at random distributed over machines. Each machine then learns a dimensional model with its observations and the machine-wise results are averaged to a global estimate . We derived several different approximations of the estimation error in with different quantitative and qualitative insights.

Insights

When not much accuracy is lost by splitting the data. This stands in contrast to other works that demonstrate how, under different assumptions, parallelization combined with averaging may incur a large error (Liu and Ihler, 2014), or even an unbounded one (Shamir et al., 2013). Our analysis can thus be viewed as providing sufficient conditions for parallelization to be a suitable approach to reduce the overall run-time. A second insight is that if the model is highly non-linear, then the excess paralellization error may be considerably large.

In contrast to the classical fixed- regime, our high-dimensional analysis, currently confined to generative linear models, showed that splitting the data when there are only few observations per parameter always takes its accuracy toll. The degradation in accuracy due to splitting can still be quantified even though estimates converge to non-degenerate random limits.

Future Research

At the basis of our work is an attempt to adhere to real-life software and hardware constraints of parallelized learning. The assumption of uniform and random distribution of samples to machines is realistic for some applications, and certainly facilitates the mathematical analysis. It may also be overly restrictive for other appications. A venue for future research is thus the relaxation of this assumption, allowing for some systematic difference between machines. We also aim at analyzing other aggregation schemes. Particularly ones that employ more than the mere machine-wise point estimate, and apply to non convex parameter spaces. An example of such is the Minimum Kullback-Leibler divergence aggregation, proposed by

Liu and Ihler (2014). This may extend the applicability of our results, for example, to image, sound, and graph data.

Acknowledgments

We thank Derek Bean, Kyoo il Kim, Yaakov Ritov, Saharon Rosset, Ohad Shamir and Yuchen Zhang for fruitful discussions. This research was partly supported by a grant from the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI).

Appendix A Proof of Theorem 1

Under Assumption Set 1, classical statistical theory guarantees that upon optimizing the empirical risk (2), the resulting estimators, , converge in probability to at rate . Moreover, the leading error term is linear in the influence functions  (Vaart, 1998, Theorem 5.23):

(A.1)

where denotes the indexes of the observations assigned to machine . Taking the average of the machine-wise estimators over a fixed number of machines , and applying Eq.(A.1) yields

(A.2)

Similarly, applying Eq.(A.1) to the centralized solution:

Since is fixed, . Eq.(4) now follows.

Appendix B Proof of Theorem 2

Under Assumption Set  2, with , by Proposition 3.2 in (Rilstone et al., 1996) admits the expansion . We can thus decompose