# Communication-efficient sparse regression: a one-shot approach

We devise a one-shot approach to distributed sparse regression in the high-dimensional setting. The key idea is to average "debiased" or "desparsified" lasso estimators. We show the approach converges at the same rate as the lasso as long as the dataset is not split across too many machines. We also extend the approach to generalized linear models.

## Authors

• 50 publications
• 19 publications
• 76 publications
• 4 publications
• ### Non-asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

We consider the finite sample properties of the regularized high-dimensi...
04/09/2012 ∙ by Shengchun Kong, et al. ∙ 0

• ### A General Family of Trimmed Estimators for Robust High-dimensional Data Analysis

We consider the problem of robustifying high-dimensional structured esti...
05/26/2016 ∙ by Eunho Yang, et al. ∙ 0

• ### High-Dimensional Metrics in R

The package High-dimensional Metrics (hdm) is an evolving collection of ...
03/05/2016 ∙ by Victor Chernozhukov, et al. ∙ 0

• ### A note relating ridge regression and OLS p-values to preconditioned sparse penalized regression

When the design matrix has orthonormal columns, "soft thresholding" the ...
11/26/2014 ∙ by Karl Rohe, et al. ∙ 0

• ### A Note on the SPICE Method

In this article, we analyze the SPICE method developed in [1], and estab...
09/21/2012 ∙ by Cristian R. Rojas, et al. ∙ 0

• ### Distributed Learning with Sublinear Communication

In distributed statistical learning, N samples are split across m machin...
02/28/2019 ∙ by Jayadev Acharya, et al. ∙ 0

• ### Sparse High-Dimensional Regression: Exact Scalable Algorithms and Phase Transitions

We present a novel binary convex reformulation of the sparse regression ...
09/28/2017 ∙ by Dimitris Bertsimas, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Explosive growth in the size of modern datasets has fueled interest in distributed statistical learning. For examples, we refer to Boyd et al. (2011); Dekel et al. (2012); Duchi, Agarwal and Wainwright (2012); Zhang, Duchi and Wainwright (2013) and the references therein. The problem arises, for example, when working with datasets that are too large to fit on a single machine and must be distributed across multiple machines. The main bottleneck in the distributed setting is usually communication between machines/processors, so the overarching goal of algorithm design is to minimize communication costs.

In distributed statistical learning, the simplest and most popular approach is averaging: each machine forms a local estimator with the portion of the data stored locally, and a “master” averages the local estimators to produce an aggregate estimator: Averaging was first studied by Mcdonald et al. (2009)

for multinomial regression. They derive non-asymptotic error bounds on the estimation error that show averaging reduces the variance of the local estimators, but has no effect on the bias (from the centralized solution). In follow-up work,

Zinkevich et al. (2010)

studied a variant of averaging where each machine computes a local estimator with stochastic gradient descent (SGD) on a random subset of the dataset. They show, among other things, that their estimator converges to the centralized estimator.

More recently, Zhang, Duchi and Wainwright (2013) studied averaged empirical risk minimization (ERM). They show that the mean squared error (MSE) of the averaged ERM decays like where is the number of machines and is the total number of samples. Thus, so long as the averaged ERM matches the convergence rate of the centralized ERM. Even more recently, Rosenblatt and Nadler (2014) studied the optimality of averaged ERM in two asymptotic settings: , fixed and , , where is the number of samples per machine. They show that in the , fixed setting, the averaged ERM is first-order equivalent to the centralized ERM. However, when the averaged ERM is suboptimal (versus the centralized ERM).

We develop an approach to distributed statistical learning in the high-dimensional setting. Since regularization is essential. At a high level, the key idea is to average local debiased regularized M-estimators. We show that our averaged estimator converges at the same rate as the centralized regularized M-estimator.

## 2 Background on the lasso and the debiased lasso

To keep things simple, we focus on sparse linear regression. Consider the sparse linear model

 y=Xβ∗+ϵ,

where the rows of are predictors, and the components of are the responses. To keep things simple, we assume

1. the predictors

are independent subgaussian random vectors whose covariance

has smallest has smallest eigenvalue

;

2. the regression coefficients are -sparse, i.e. all but components of are zero;

3. the components of

are independent, mean zero subgaussian random variables.

Given the predictors and responses, the lasso estimates by

 ^β:=argminβ∈Rp12n∥y−Xβ∥22+λ∥β∥1.

There is a well-developed theory of the lasso that says, under suitable assumptions on the lasso estimator is nearly as close to as the oracle estimator: (e.g. see Hastie, Tibshirani and Wainwright (2015), Chapter 11 for an overview). More precisely, under some conditions on the MSE of the lasso estimator is roughly Since the MSE of the oracle estimator is (roughly) the lasso estimator is almost as good as the oracle estimator.

However, the lasso estimator is also biased111We refer to Section 2.2 in Javanmard and Montanari (2013a) for a more formal discussion of the bias of the lasso estimator.. Since averaging only reduces variance, not bias, we gain (almost) nothing by averaging the biased lasso estimators. That is, it is possible to show if we naively averaged local lasso estimators, the MSE of the averaged estimator is of the same order as that of the local estimators. The key to overcoming the bias of the averaged lasso estimator is to “debias” the lasso estimators before averaging.

The debiased lasso estimator by Javanmard and Montanari (2013a) is

 ^βd:=^β+1n^ΘXT(y−X^β), (2.1)

where is the lasso estimator and is an approximate inverse to Intuitively, the debiased lasso estimator trades bias for variance. The trade-off is obvious when is non-singular: setting

gives the ordinary least squares (OLS) estimator

Another way to interpret the debiased lasso estimator is a corrected estimator that compensates for the bias incurred by shrinkage. By the optimality conditions of the lasso, the correction term is a subgradient of at By adding a term proportional to the subgradient of the regularizer, the debiased lasso estimator compensates for the bias incurred by regularization. The debiased lasso estimator has previously been used to perform inference on the regression coefficients in high-dimensional regression models. We refer to the papers by Javanmard and Montanari (2013a); van de Geer et al. (2013); Zhang and Zhang (2014); Belloni, Chernozhukov and Hansen (2011) for details.

The choice of in the correction term is crucial to the performance of the debiased estimator. Javanmard and Montanari (2013a) suggest forming row by row: the -th row of is the optimum of

 minimizeθ∈Rp θT^Σθ (2.2) subject to ∥^Σθ−ej∥∞≤δ.

The parameter should large enough to keep the problem feasible, but as small as possible to keep the bias (of the debiased lasso estimator) small. As we shall see, when the rows of are subgaussian, setting is usually large enough to keep (2.2) feasible.

###### Definition 2.1 (Generalized coherence).

Given let The generalized coherence between and is

 GC(^Σ,Θ)=maxj∈[p]∥^ΣΘTj−ej∥∞.
###### Lemma 2.2 (Javanmard and Montanari (2013a)).

Under (A1), when the event

 EGC(^Σ):={GC(^Σ,Σ−1)≤8√c1√κσ2x(logpn)12}

occurs with probability at least

for some where is the condition number of .

As we shall see, the bias of the debiased lasso estimate is of higher order than its variance under suitable conditions on In particular, we require to satisfy the restricted eigenvalue (RE) condition.

###### Definition 2.3 (RE condition).

For any let

We say satisfies the RE condition on the cone when

 ΔT^ΣΔ≥μl∥ΔS∥22

for some and any

The RE condition requires to be positive definite on When the rows of are i.i.d. Gaussian random vectors, Raskutti, Wainwright and Yu (2010) show there are constants such that

 1n∥XΔ∥22≥μ1∥Δ∥22−μ2logpn∥Δ∥21 for any Δ∈Rp

with probability at least Their result implies the RE condition holds on (for any ) as long as even when there are dependencies among the predictors. Their result was extended to subgaussian designs by Rudelson and Zhou (2013), also allowing for dependencies among the covariates. We summarize their result in a lemma.

###### Lemma 2.4.

Under (A1), when and , where , the event

 ERE(X)={ΔT^ΣΔ≥12λmin(Σ)∥ΔS∥22 for any Δ∈C(S)}

occurs with probability at least

###### Proof.

The lemma is a consequence of Rudelson and Zhou (2013), Theorem 6. In their notation, we set and bound and by and

When the RE condition holds, the lasso and debiased lasso estimators are consistent for a suitable choice of the regularization parameter The parameter should be large enough to dominate the “empirical process” part of the problem: but as small as possible to reduce the bias incurred by regularization. As we shall see, setting is a good choice.

###### Lemma 2.5.

Under (A3),

 1n∥XTϵ∥∞≤maxj∈[p](^Σj,j)12σy(3logpc2n)12

with probability at least for any (non-random) .

When satisfies the RE condition and is large enough, the lasso and debiased lasso estimators are consistent.

###### Lemma 2.6 (Negahban et al. (2012)).

Under (A2) and (A3), suppose satisfies the RE condition on with constant and ,

 ∥^β−β∥1≤3μlsλ and ∥^β−β∥2≤3μl√sλ.

When the lasso estimator is consistent, the debiased lasso estimator is also consistent. Further, it is possible to show that the bias of the debiased estimator is of higher order than its variance. Similar results by Javanmard and Montanari (2013a); van de Geer et al. (2013); Zhang and Zhang (2014); Belloni, Chernozhukov and Hansen (2011) are the key step in showing the asymptotic normality of the (components of) the debiased lasso estimator. The result we state is essentially Javanmard and Montanari (2013a), Theorem 2.3.

###### Lemma 2.7.

Under the conditions of Lemma 2.6, when has generalized incoherence the debiased lasso estimator has the form

 ^βd=β∗+1n^ΘXTϵ+^Δ,

where

Lemma 2.7, together with Lemmas 2.5 and 2.2, shows that the bias of the debiased lasso estimator is of higher order than its variance. In particular, setting and according to Lemmas 2.5 and 2.2 gives a bias term that is By comparison, the variance term is the maximum of subgaussian random variables with mean zero and variances of which is Thus the bias term is of higher order than the variance term as long as .

###### Corollary 2.8.

Under (A2), (A3), and the conditions of Lemma 2.6, when has generalized incoherence and we set

 ∥^Δ∥∞≤3√3√c2δ′maxj∈[p](^Σj,j)12μlσyslogpn.

## 3 Averaging debiased lassos

Recall the problem setup: we are given samples of the form distributed across machines:

 X=⎡⎢ ⎢⎣X1⋮Xm⎤⎥ ⎥⎦,y=⎡⎢ ⎢⎣y1⋮ym⎤⎥ ⎥⎦.

The -th machine has local predictors and responses To keep things simple, we assume the data is evenly distributed, i.e. The averaged debiased lasso estimator (for lack of a better name) is

 ¯β=1mm∑k=1^βdk=1mm∑k=1^βk+^ΘkXTk(yk−Xk^βk), (3.1)

We study the error of the averaged debiased lasso in the norm.

###### Lemma 3.1.

Suppose the local sparse regression problem on each machine satisfies the conditions of Corollary 2.8, that is when ,

1. satisfy the RE condition on with constant

2. have generalized incoherence

3. we set

Then

 ∥¯β−β∗∥∞≤cσy((cΩlogpN)12+cGCcΣμlσyslogpn)

with probability at least where is a universal constant, and

Lemma 3.1 hints at the performance of the averaged debiased lasso. In particular, we note the first term is which matches the convergence rate of the centralized estimator. When is large enough, is negligible compared to and the error is

Finally, we show the conditions of Lemma 3.1 occur with high probability when the rows of are independent subgaussian random vectors.

###### Theorem 3.2.

Under (A1), (A2), and (A3), when , ,

1. ,

2. we set ,

3. we set and form by (2.2),

 ∥¯β−β∗∥∞≤c(σy(maxj∈[p]Σ−1j,jlogpN)12+√κmaxj∈[p](Σj,j)12λmin(Σ)σ2xσyslogpn)

with probability at least for some universal constant

###### Proof.

 ∥¯β−β∗∥∞≤σy(2cΩlogpc2N)12+3√3√c2cGCcΣμlσyslogpn.

First, we show that the two constants and are bounded with high probability.

###### Lemma 3.3.

Under (A1),

 Pr(maxj∈[p]Σ−1j^ΣΣ−1j>2maxj∈[p]Σ−1j,j)≤2pe−c1min{nσ2x,nσx}

for some universal constant

Since we form by (2.2),

 (^Θk^Σk^ΘTk)j,j≤maxj∈[p](Σ−1^ΣkΣ−1))j,j.

Lemma 3.3 implies

 maxj∈[p](Σ−1^ΣkΣ−1))j,j≤2maxj∈[p]Σ−1j,j for each k∈[m]

with probability at least

###### Lemma 3.4.

Under (A1),

 Pr(maxj∈[p](^Σj,j)12>√2maxj∈[p](Σj,j)12)≤2pe−c1min{n16σ2x,n4σx}

for some universal constant

We put the pieces together to obtain the stated result:

1. By Lemma 3.3 (and a union bound over ),

 Pr(cΩ≥2maxjΣ−1j,j)≤2mpe−c1min{nσ2x,nσx}.

Since when

 Pr(cΩ<2maxjΣ−1j,j)≥1−2p−1.
2. By Lemma 3.4 (and a union bound over ),

 Pr(cΣ<√2maxj∈[p](Σj,j)12)≥1−2mpe−c1min{n16σ2x,n4σx}.

When the right side is again at most

3. By Lemma 2.4, as long as

 n>max{4000~sσ2xlog(60√2ep~s),8000σ4xlogp},

all satisfy the RE condition with probability at least

 1−2me−n4000σ4x≥1−2p−1.
4. By Lemma 2.2,

 Pr(∩k∈[m]EGC(^Σk))≥1−2p−2.

Since the probability is at least

We apply the bounds , , and to obtain

 ∥¯β−β∗∥∞≤σy(4maxj∈[p]Σ−1j,jlogpc2N)12+48√6√c1c2√κmaxj∈[p](Σj,j)12λmin(Σ)σ2xσyslogpn.

We validate our theoretical results with simulations. First, we study the estimation error of the averaged debiased lasso in norm. To focus on the effect of averaging, we grow the number of machines linearly with the (total) sample size In other words, we fix the sample size per machine and grow the total sample size by adding machines. Figure 1 compares the estimation error (in norm) of the averaged debiased lasso estimator with that of the centralized lasso. We see the estimation error of the averaged debiased lasso estimator is comparable to that of the centralized lasso, while that of the naive averaged lasso is much worse.

We conduct a second set of simulations to study the effect of the number of machines on the estimation effort of the averaged estimator. To focus on the effect of the number of machines we fix the (total) sample size and vary the number of machines the samples are distributed across. Figure 2 shows how the estimation error (in norm) of the averaged estimator grows as the number of machines grows. When the number of machines is small, the estimation error of the averaged estimator is comparable to that of the centralized lasso. However, when the number of machines exceeds a certain threshold, the estimation error grows with the number of machines. This is consistent with the prediction of Theorem 3.2: when the number of machines exceeds a certain threshold, the bias term of order becomes dominant.

The averaged debiased lasso has one serious drawback versus the lasso: is usually dense. The density of detracts from the intrepretability of the coefficients and makes the estimation error large in the and norms. To remedy both problems, we threshold the averaged debiased lasso:

 HTt(¯β) ←¯βj⋅1{∣∣¯βj∣∣≥t}, STt(¯β) ←sign(¯βj)⋅max{|¯βj|−t,0}.

As we shall see, both hard and soft-thresholding give sparse aggregates that are close to in norm.

###### Lemma 3.5.

As long as satisfies

The analogous result also holds for

###### Proof.

By the triangle inequality,

 ∥¯βht−β∗∥∞ ≤∥¯βht−¯β∥∞+∥¯β−β∗∥∞ ≤t+∥∥¯β−β∗∥∥∞ ≤2t.

Since whenever Thus is -sparse and is -sparse. By the equivalence between the and , norms,

 ∥¯βht−β∗∥2≤2√2st, ∥¯βht−β∗∥1≤2√2st.

The argument for is similar. ∎

By combining Lemma 3.5 with Theorem 3.2, we show that converges at the same rates as the centralized lasso.

###### Theorem 3.6.

Under the conditions of Theorem 3.2, hard-thresholding at gives

###### Remark 3.7.

By Theorem 3.6, when the variance term is dominant and the convergence rates given by the theorem simplify:

The convergence rates for the centralized lasso estimator are identical (modulo constants):

The estimator matches the convergence rates of the centralized lasso in , , and norms. Furthermore, can be evaluated in a communication-efficient manner by a one-shot averaging approach.

We conduct a third set of simulations to study the effect of thresholding on the estimation error in norm. Figure 3 compares the estimation error incurred by the averaged estimator with and without thresholding versus that of the centralized lasso. Since the averaged estimator is usually dense, its estimation error (in norm) is large compared to that of the centralized lasso. However, after thresholding, the averaged estimator performs comparably versus the centralized lasso.

## 4 A distributed approach to debiasing

The averaged estimator we studied has the form

 ¯β=1mm∑k=1^βk+^ΘkXTk(y−Xk^βk).

The estimator requires each machine to form by the solution of (2.2). Since the dual of (2.2) is an -regularized quadratic program:

 minimizeγ∈Rp12γT^Σkγ−^Σkγ+δ∥γ∥1, (4.1)

forming is (roughly speaking) times as expensive as solving the local lasso problem, making it the most expensive step (in terms of FLOPS) of evaluating the averaged estimator. To trim the cost of the debiasing step, we consider an estimator that forms only a single

 ~β=1mm∑k=1^βk+1N^Θm∑k=1XTk(y−Xk^βk). (4.2)

To evaluate (4.2),

1. each machine sends and to a central server,

2. the central server forms and and sends the averages to all the machines,

3. each machine, given the averages, forms rows of and debiases coefficients:

where is a row vector.

As we shall see, each machine can perform debiasing with only the data stored locally. Thus, forming the estimator (4.2) requires two rounds of communication.

The question that remains is how to form We consider an estimator proposed by van de Geer et al. (2013): nodewise regression on the predictors. For some that machine is debiasing, the machine solves

 ^γj:=argminγ∈Rp−112n∥Xk,j−Xk,−jγ∥22+λj∥γ∥1,j∈[p],

where is less its -th column . Implicitly, we are forming

 ^C:=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1−^γ1,2…−^γ1,p−^γ2,11…−^γ2,p⋮⋮⋱⋮−^γp,1−^γp,2…−^γp,p⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

where the components of are indexed by We scale the rows of by , where

 ^τj=(1n∥Xj−X−j^γj∥22+λj∥^γj∥1)12,

to form Each row of is given by

 ^Θj,⋅=−1^τ2j[^γj,1…^γj,j−11^γj,j+1…^γj,p]. (4.3)

Since and only depend on they can be formed without any communication.

Before we justify the choice of theoretically, we mention that it is a approximate “inverse” of (in a component-wise sense). By the optimality conditions of nodewise regression,

 ^τ2j =1n∥Xj−X−j^γj∥22+λj∥^γj∥1 =1n∥Xj−X−j^γj∥22+1n(Xj−X−j^γj)TXT−j^γj =1nXj(Xj−X−j^γ).

Recalling the defintition of , we have

 1n^Θj,⋅XT