1 Introduction
Explosive growth in the size of modern datasets has fueled interest in distributed statistical learning. For examples, we refer to Boyd et al. (2011); Dekel et al. (2012); Duchi, Agarwal and Wainwright (2012); Zhang, Duchi and Wainwright (2013) and the references therein. The problem arises, for example, when working with datasets that are too large to fit on a single machine and must be distributed across multiple machines. The main bottleneck in the distributed setting is usually communication between machines/processors, so the overarching goal of algorithm design is to minimize communication costs.
In distributed statistical learning, the simplest and most popular approach is averaging: each machine forms a local estimator with the portion of the data stored locally, and a “master” averages the local estimators to produce an aggregate estimator: Averaging was first studied by Mcdonald et al. (2009)
for multinomial regression. They derive nonasymptotic error bounds on the estimation error that show averaging reduces the variance of the local estimators, but has no effect on the bias (from the centralized solution). In followup work,
Zinkevich et al. (2010)studied a variant of averaging where each machine computes a local estimator with stochastic gradient descent (SGD) on a random subset of the dataset. They show, among other things, that their estimator converges to the centralized estimator.
More recently, Zhang, Duchi and Wainwright (2013) studied averaged empirical risk minimization (ERM). They show that the mean squared error (MSE) of the averaged ERM decays like where is the number of machines and is the total number of samples. Thus, so long as the averaged ERM matches the convergence rate of the centralized ERM. Even more recently, Rosenblatt and Nadler (2014) studied the optimality of averaged ERM in two asymptotic settings: , fixed and , , where is the number of samples per machine. They show that in the , fixed setting, the averaged ERM is firstorder equivalent to the centralized ERM. However, when the averaged ERM is suboptimal (versus the centralized ERM).
We develop an approach to distributed statistical learning in the highdimensional setting. Since regularization is essential. At a high level, the key idea is to average local debiased regularized Mestimators. We show that our averaged estimator converges at the same rate as the centralized regularized Mestimator.
2 Background on the lasso and the debiased lasso
To keep things simple, we focus on sparse linear regression. Consider the sparse linear model
where the rows of are predictors, and the components of are the responses. To keep things simple, we assume

the predictors
are independent subgaussian random vectors whose covariance
has smallest has smallest eigenvalue
; 
the regression coefficients are sparse, i.e. all but components of are zero;

the components of
are independent, mean zero subgaussian random variables.
Given the predictors and responses, the lasso estimates by
There is a welldeveloped theory of the lasso that says, under suitable assumptions on the lasso estimator is nearly as close to as the oracle estimator: (e.g. see Hastie, Tibshirani and Wainwright (2015), Chapter 11 for an overview). More precisely, under some conditions on the MSE of the lasso estimator is roughly Since the MSE of the oracle estimator is (roughly) the lasso estimator is almost as good as the oracle estimator.
However, the lasso estimator is also biased^{1}^{1}1We refer to Section 2.2 in Javanmard and Montanari (2013a) for a more formal discussion of the bias of the lasso estimator.. Since averaging only reduces variance, not bias, we gain (almost) nothing by averaging the biased lasso estimators. That is, it is possible to show if we naively averaged local lasso estimators, the MSE of the averaged estimator is of the same order as that of the local estimators. The key to overcoming the bias of the averaged lasso estimator is to “debias” the lasso estimators before averaging.
The debiased lasso estimator by Javanmard and Montanari (2013a) is
(2.1) 
where is the lasso estimator and is an approximate inverse to Intuitively, the debiased lasso estimator trades bias for variance. The tradeoff is obvious when is nonsingular: setting
gives the ordinary least squares (OLS) estimator
Another way to interpret the debiased lasso estimator is a corrected estimator that compensates for the bias incurred by shrinkage. By the optimality conditions of the lasso, the correction term is a subgradient of at By adding a term proportional to the subgradient of the regularizer, the debiased lasso estimator compensates for the bias incurred by regularization. The debiased lasso estimator has previously been used to perform inference on the regression coefficients in highdimensional regression models. We refer to the papers by Javanmard and Montanari (2013a); van de Geer et al. (2013); Zhang and Zhang (2014); Belloni, Chernozhukov and Hansen (2011) for details.
The choice of in the correction term is crucial to the performance of the debiased estimator. Javanmard and Montanari (2013a) suggest forming row by row: the th row of is the optimum of
(2.2)  
subject to 
The parameter should large enough to keep the problem feasible, but as small as possible to keep the bias (of the debiased lasso estimator) small. As we shall see, when the rows of are subgaussian, setting is usually large enough to keep (2.2) feasible.
Definition 2.1 (Generalized coherence).
Given let The generalized coherence between and is
Lemma 2.2 (Javanmard and Montanari (2013a)).
Under (A1), when the event
occurs with probability at least
for some where is the condition number of .As we shall see, the bias of the debiased lasso estimate is of higher order than its variance under suitable conditions on In particular, we require to satisfy the restricted eigenvalue (RE) condition.
Definition 2.3 (RE condition).
For any let
We say satisfies the RE condition on the cone when
for some and any
The RE condition requires to be positive definite on When the rows of are i.i.d. Gaussian random vectors, Raskutti, Wainwright and Yu (2010) show there are constants such that
with probability at least Their result implies the RE condition holds on (for any ) as long as even when there are dependencies among the predictors. Their result was extended to subgaussian designs by Rudelson and Zhou (2013), also allowing for dependencies among the covariates. We summarize their result in a lemma.
Lemma 2.4.
Under (A1), when and , where , the event
occurs with probability at least
Proof.
The lemma is a consequence of Rudelson and Zhou (2013), Theorem 6. In their notation, we set and bound and by and ∎
When the RE condition holds, the lasso and debiased lasso estimators are consistent for a suitable choice of the regularization parameter The parameter should be large enough to dominate the “empirical process” part of the problem: but as small as possible to reduce the bias incurred by regularization. As we shall see, setting is a good choice.
Lemma 2.5.
Under (A3),
with probability at least for any (nonrandom) .
When satisfies the RE condition and is large enough, the lasso and debiased lasso estimators are consistent.
Lemma 2.6 (Negahban et al. (2012)).
Under (A2) and (A3), suppose satisfies the RE condition on with constant and ,
When the lasso estimator is consistent, the debiased lasso estimator is also consistent. Further, it is possible to show that the bias of the debiased estimator is of higher order than its variance. Similar results by Javanmard and Montanari (2013a); van de Geer et al. (2013); Zhang and Zhang (2014); Belloni, Chernozhukov and Hansen (2011) are the key step in showing the asymptotic normality of the (components of) the debiased lasso estimator. The result we state is essentially Javanmard and Montanari (2013a), Theorem 2.3.
Lemma 2.7.
Under the conditions of Lemma 2.6, when has generalized incoherence the debiased lasso estimator has the form
where
Lemma 2.7, together with Lemmas 2.5 and 2.2, shows that the bias of the debiased lasso estimator is of higher order than its variance. In particular, setting and according to Lemmas 2.5 and 2.2 gives a bias term that is By comparison, the variance term is the maximum of subgaussian random variables with mean zero and variances of which is Thus the bias term is of higher order than the variance term as long as .
Corollary 2.8.
Under (A2), (A3), and the conditions of Lemma 2.6, when has generalized incoherence and we set
3 Averaging debiased lassos
Recall the problem setup: we are given samples of the form distributed across machines:
The th machine has local predictors and responses To keep things simple, we assume the data is evenly distributed, i.e. The averaged debiased lasso estimator (for lack of a better name) is
(3.1) 
We study the error of the averaged debiased lasso in the norm.
Lemma 3.1.
Suppose the local sparse regression problem on each machine satisfies the conditions of Corollary 2.8, that is when ,

satisfy the RE condition on with constant

have generalized incoherence

we set
Then
with probability at least where is a universal constant, and
Lemma 3.1 hints at the performance of the averaged debiased lasso. In particular, we note the first term is which matches the convergence rate of the centralized estimator. When is large enough, is negligible compared to and the error is
Finally, we show the conditions of Lemma 3.1 occur with high probability when the rows of are independent subgaussian random vectors.
Theorem 3.2.
Under (A1), (A2), and (A3), when , ,

,

we set ,

we set and form by (2.2),
with probability at least for some universal constant
Proof.
We start with the conclusion of Lemma 3.1:
First, we show that the two constants and are bounded with high probability.
Lemma 3.3.
Under (A1),
for some universal constant
Lemma 3.4.
Under (A1),
for some universal constant
We put the pieces together to obtain the stated result:
We apply the bounds , , and to obtain
∎
We validate our theoretical results with simulations. First, we study the estimation error of the averaged debiased lasso in norm. To focus on the effect of averaging, we grow the number of machines linearly with the (total) sample size In other words, we fix the sample size per machine and grow the total sample size by adding machines. Figure 1 compares the estimation error (in norm) of the averaged debiased lasso estimator with that of the centralized lasso. We see the estimation error of the averaged debiased lasso estimator is comparable to that of the centralized lasso, while that of the naive averaged lasso is much worse.
We conduct a second set of simulations to study the effect of the number of machines on the estimation effort of the averaged estimator. To focus on the effect of the number of machines we fix the (total) sample size and vary the number of machines the samples are distributed across. Figure 2 shows how the estimation error (in norm) of the averaged estimator grows as the number of machines grows. When the number of machines is small, the estimation error of the averaged estimator is comparable to that of the centralized lasso. However, when the number of machines exceeds a certain threshold, the estimation error grows with the number of machines. This is consistent with the prediction of Theorem 3.2: when the number of machines exceeds a certain threshold, the bias term of order becomes dominant.
The averaged debiased lasso has one serious drawback versus the lasso: is usually dense. The density of detracts from the intrepretability of the coefficients and makes the estimation error large in the and norms. To remedy both problems, we threshold the averaged debiased lasso:
As we shall see, both hard and softthresholding give sparse aggregates that are close to in norm.
Lemma 3.5.
As long as satisfies
The analogous result also holds for
Proof.
By the triangle inequality,
Since whenever Thus is sparse and is sparse. By the equivalence between the and , norms,
The argument for is similar. ∎
By combining Lemma 3.5 with Theorem 3.2, we show that converges at the same rates as the centralized lasso.
Theorem 3.6.
Under the conditions of Theorem 3.2, hardthresholding at gives
Remark 3.7.
By Theorem 3.6, when the variance term is dominant and the convergence rates given by the theorem simplify:
The convergence rates for the centralized lasso estimator are identical (modulo constants):
The estimator matches the convergence rates of the centralized lasso in , , and norms. Furthermore, can be evaluated in a communicationefficient manner by a oneshot averaging approach.
We conduct a third set of simulations to study the effect of thresholding on the estimation error in norm. Figure 3 compares the estimation error incurred by the averaged estimator with and without thresholding versus that of the centralized lasso. Since the averaged estimator is usually dense, its estimation error (in norm) is large compared to that of the centralized lasso. However, after thresholding, the averaged estimator performs comparably versus the centralized lasso.
4 A distributed approach to debiasing
The averaged estimator we studied has the form
The estimator requires each machine to form by the solution of (2.2). Since the dual of (2.2) is an regularized quadratic program:
(4.1) 
forming is (roughly speaking) times as expensive as solving the local lasso problem, making it the most expensive step (in terms of FLOPS) of evaluating the averaged estimator. To trim the cost of the debiasing step, we consider an estimator that forms only a single
(4.2) 
To evaluate (4.2),

each machine sends and to a central server,

the central server forms and and sends the averages to all the machines,

each machine, given the averages, forms rows of and debiases coefficients:
where is a row vector.
As we shall see, each machine can perform debiasing with only the data stored locally. Thus, forming the estimator (4.2) requires two rounds of communication.
The question that remains is how to form We consider an estimator proposed by van de Geer et al. (2013): nodewise regression on the predictors. For some that machine is debiasing, the machine solves
where is less its th column . Implicitly, we are forming
where the components of are indexed by We scale the rows of by , where
to form Each row of is given by
(4.3) 
Since and only depend on they can be formed without any communication.
Before we justify the choice of theoretically, we mention that it is a approximate “inverse” of (in a componentwise sense). By the optimality conditions of nodewise regression,
Recalling the defintition of , we have
Comments
There are no comments yet.