Distributed Multitask Learning

10/02/2015 ∙ by Jialei Wang, et al. ∙ The University of Chicago The University of Chicago Booth School of Business 0

We consider the problem of distributed multi-task learning, where each machine learns a separate, but related, task. Specifically, each machine learns a linear predictor in high-dimensional space,where all tasks share the same small support. We present a communication-efficient estimator based on the debiased lasso and show that it is comparable with the optimal centralized method.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning multiple tasks simultaneously allows transferring information between related tasks and for improved performance compared to learning each tasks separately [Caruana, 1997]. It has been successfully exploited in, e.g., spam filtering [Weinberger et al., 2009], web search [Chapelle et al., 2010], disease prediction [Zhou et al., 2013] and eQTL mapping [Kim and Xing, 2010].

Tasks could be related to each other in a number of ways. In this paper, we focus on the high-dimensional multi-task setting with joint support where a few variables are related to all tasks, while others are not predictive [Turlach et al., 2005; Obozinski et al., 2011; Lounici et al., 2011]. The standard approach is to use the mixed or penalty, as such penalties encourage selection of variables that affect all tasks. Using a mixed norm penalty leads to better performance in terms of prediction, estimation and model selection compared to using the norm penalty, which is equivalent to considering each task separately.

Shared support multi-task learning is generally considered in a centralized setting where data from all tasks is available on a single machine, and the estimator is computed using a standard single-thread algorithm. With the growth of modern massive data sets, there is a need to revisit multi-task learning in a distributed setting, where tasks and data are distributed across machines and communication is expensive. In particular, we consider a setting where each machine holds one “task” and its related data.

We develop an efficient distributed algorithm for multi-task learning that exploits shared sparsity between tasks. Our algorithm (DSML) requires only one round of communication between the workers and the central node, involving each machine sending a vector to the central node and receiving back a support set. Despite the limited communication, our algorithm enjoys the same theoretical guarantees, in terms of the leading term in reasonable regimes and mild conditions, as the centralized approach. Table

1 summarizes our support recovery guarantees compared to the centralized (group lasso) and local (lasso) approaches, while Table 2 compares the parameter and prediction error guarantees.

Approach Communication Assumptions Min signal strength Strength type
Lasso 0
Mutual Incoherence

Sparse Eigenvalue

Group lasso
Mutual Incoherence
Sparse Eigenvalue
Generalized Coherence
Restricted Eigenvalue
Table 1: Lower bound on coefficients required to ensure support recovery with variables, tasks, samples per task and a true support of size .
Approach Assumptions estimation error Prediction error
Lasso Restricted Eigenvalue
Group lasso Restricted Eigenvalue
Generalized Coherence
Restricted Eigenvalue
Table 2: Comparison of parameter estimation errors and prediction errors. The DSML guarantees improve over Lasso and have the same leading term as the Group lasso as long as .

2 Distributed Learning and Optimization

With the increase in the volume of data used for machine learning, and the availability of distributed computing resources, distributed learning and the use of distributed optimization for machine learning has received much attention.

Most work on distributed optimization focuses on “consensus problems”, where each machine holds a different objective and the goal is to communicate between the machines so as to jointly optimize the average objective , that is, to find a single vector that is good for all local objectives [Boyd et al., 2011]. The difficulty of consensus problems is that the local objectives might be rather different, and, as a result, one can obtain lower bounds on the amount of communication that must be exchanged in order to reach a joint optimum. In particular, the problem becomes harder as more machines are involved.

The consensus problem has also been studied in the stochastic setting [Ram et al., 2010], in which each machine receive stochastic estimates of its local objective. Thinking of each local objective as a generalization error w.r.t. a local distribution, we obtain the following distributed learning formulation [Balcan et al., 2012]: each machine holds a different source distribution from which it can sample, and this distribution corresponds to a different local generalization error . The goal is to find a single predictor that minimizes the average generalization error, based on samples sampled at the local nodes. Again, the problem becomes harder when more machines are involved and one can obtain lower bounds on the amount of communication required—[Balcan et al., 2012] carry out such an analysis for several hypothesis classes.

A more typical situation in machine learning is one in which there is only a single source distribution , and data from this single source is distributed randomly across the machines (or equivalently, each machine has access to the same source distribution ). Such a problem can be reduced to a consensus problem by performing consensus optimization of the empirical errors at each machine. However, such an approach ignores several issues: first, the local empirical objectives are not arbitrarily different, but rather quite similar, which can and should be taken advantage of in optimization [Shamir et al., 2014]. Second, since each machine has access to the source distribution, there is no lower bound on communication—an entirely “local” approach is possible, were each machine completely ignores other machines and just uses its own data. In fact, increasing the number of machines only makes the problem easier (in that it can reduce the runtime or number of samples per machine required to achieve target performance), as additional machines can always be ignored. In such a setting, the other relevant baseline is the “centralized” approach, where all data is communicated to a central machine which computes a predictor centrally. The goal here is then to obtain performance close to that of the “centralized” approach (and much better than the “local” approach), using roughly the same number of samples, but with low communication and computation costs. Such single-source distributed problems have been studied both in terms of predictive performance [Shamir and Srebro, 2014; Jaggi et al., 2014] and parameter estimation [Zhang et al., 2013b, a; Lee et al., 2015].

In this paper we suggest a novel setting that combines aspects of the above two extremes. On one hand, we assume that each machine has a different source distributions , corresponding to a different task, as in consensus problems and in [Balcan et al., 2012]. For example, each machine serves a different geographical location, or each is at a different hospital or school with different characteristics. But if indeed there are differences between the source distributions, it is natural to learn different predictors for each machine, so that is good for the distribution typical to that machine. In this regard, our distributed multi-task learning problem is more similar to single-source problems, in that machines could potentially learn on their own given enough samples and enough time. Furthermore, availability of other machines just makes the problem easier by allowing transfer between the machine, thus reducing the sample complexity and runtime. The goal, then, is to leverage as much transfer as possible, while limiting communication and runtime. As with single-source problems, we compare our method to the two baselines, where we would like to be much better than the “local” approach, achieving performance nearly as good as the “centralized” approach, but with minimal communication and efficient runtime.

To the best of our knowledge, the only previous discussion of distributed multi-task learning is [Dinuzzo et al., 2011], which considered a different setting with an almost orthogonal goal: a client-server architecture, where the server collects data from different clients, and send sufficient information that might be helpful for each client to solve its own task. Their emphasis is on preserving privacy, but their architecture is communication-heavy as the entire data set is communicated to the central server, as in the “centralized” bases line. On the other hand, we are mostly concerned with communication costs, but, for the time being, do not address privacy concerns.

3 Preliminaries

We consider the following multi-task linear regression model with



where , , and is a noise vector, and is the unknown vector of coefficients for the task . For notation simplicity we assume each task has equal sample size and the same noise level, that is, we assume, and . We will be working in a high-dimensional regime with possibly larger than , however, we will assume that each is sparse, that is, few components of are different from zero. Furthermore, we assume that the support between the tasks is shared. In particular, let , with and . Suppose the data sets are distributed across machines, our goal is to estimate as accurately as possible, while maintaining low communication cost.

The lasso estimate for each task is given by:


The multi-task estimates are given by the joint optimization:


where is the regularizaton that promote group sparse solutions. For example, the group lasso penalty uses [Yuan and Lin, 2006], while the iCAP uses [Zhao et al., 2009]. In a distributed setting, one could potentially minimize (3) using a distributed consensus procedure (see Section 2), but such an approach would generally require multiple round of communication. Our procedure, described in the next section, lies in between the local lasso (2) and centralized estimate (3), requiring only one round of communication to compute, while still ensuring much of the statistical benefits of using group regularization.

4 Methodology

In this section, we detail our procedure for performing estimation under model in (16). Algorithm 1 provides an outline of the steps executed by the worker nodes and the master node, which are explained in details below.

for  do
       Each worker obtains as a solution to a local lasso in (2);
       Each worker obtains the debiased lasso estimate in (17) and sends it to the master;
       if Receive from the master then
             Calculate final estimate in (6).
       end if
end for
if Receive from all workers then
       Compute by group hard thresholding in(5) and send the result back to every worker.
end if
Algorithm 1 DSML:Distributed debiased Sparse Multi-task Lasso.

Recall that each worker node contains data for one task. That is, a node contains data . In the first step, each worker node solves a lasso problem locally, that is, a node minimizes the program in (2) and obtains . Next, a worker node constructs a debiased lasso estimator

by performing one Newton step update on the loss function, starting at the estimated value



where is a subgradient of the norm and the matrix serves as an approximate inverse of the Hessian. The idea of debiasing the lasso estimator was introduced in the recent literature on statistical inference in high-dimensions [Zhang and Zhang, 2013; van de Geer et al., 2014; Javanmard and Montanari, 2014]. By removing the bias introduced through the penalty, one can estimate the sampling distribution of a component of and make inference about the unknown parameter of interest. In our paper, we will also utilize the sampling distribution of the debiased estimator, however, with a different goal in mind. The above mentioned papers proposed different techniques to construct the matrix . Here, we adopt the approach proposed in [Javanmard and Montanari, 2014], as it leads to weakest assumption on the model in (16): each machine uses a matrix with rows:

where is the vector with -th component equal to 1 and 0 otherwise and .

After each worker obtains the debiased estimator , it sends it to the central machine. After debiasing, the estimator is no longer sparse and as a result each worker communicates numbers to the master node. It is at the master where shared sparsity between the task coefficients gets utilized. The master node concatenates the received estimators into a matrix . Let be the -th row of . The master performs the hard group thresholding to obtain an estimate of as


The estimated support is communicated back to each worker, which then use the estimate of the support to filter their local estimate. In particular, each worker produces the final estimate:


Extension to multitask classification.

DSML can be generalized to estimate multi-task generalized linear models. We be briefly outline how to extend DSML to a multi-task logistic regression model, where



First, each worker solves the -regularized logistic regression problem

Let be a diagonal weighting matrix, with a -th diagonal element

which will be used to approximately invert the Hessian matrix of the logistic loss. The matrix , which serves as an approximate inverse of the Hessian, in the case of logistic regression can be obtained as a solution to the following optimization problem:

Finally, the debiased estimator is obtained as

and then communicated to the master node. The rest of procedure is as described before.

5 Theoretical Analysis

In this section, we present our main theoretical results for the DSML procedure described in the previous section. We start by describing assumptions that we make on the model in (16). Our results are based on the random design analysis, and we also discuss fixed design case in appendix. Let the data for -task are drawn from a subgaussian random vector with covariance matrix . We assume the subguassian random vectors for every task have bounded subgaussian norm: [Vershynin, 2012]. Let to be the minimal eigenvalue of , and be its maximal eigenvalue. Let and be the bound on the eigenvalues of these covariance matrices. Let be the maximal diagonal elements of the inverse convariance matrices:

The following theorem is our main result, which is proved in appendix. Suppose in (2) was chosen as . Furthermore, suppose that the multi-task coefficients in (16) satisfy the following bound on the signal strength


where . Then the support estimated by the master node satisfies

with probability at least


Let us compare the minimal signal strength to that required by the lasso and group lasso. Let be the matrix of true coefficients. Simplifying (8), we have that our procedure requires the minimum signal strength to satisfy


where means that for some , . For the centralized group lasso, the standard analysis assumes a stronger condition on the data, namely that the design matrix satisfies mutual incoherence with parameter and sparse eigenvalue condition. Mutual incoherence is a much stronger conditions on the design in comparison to the generalized coherence condition required by DSML. Group lasso recovers the support if [Corollary 5.2 of Lounici et al., 2011]:


where is some constant depend on the mutual incoherence and sparse eigenvalue parameters. Under the irrepresentable condition on the design (which is weaker than the mutual incoherence), the lasso requires the signal to satisfy [Bunea, 2008; Wainwright, 2009]:


for some constant of the mutual coherence parameter and of

. Ignoring for the moment the differences in the conditions on the design matrix, there are two advantages of the multitask group lasso over the local lasso: relaxing the signal strength requirement to a requirement on the average strength across tasks, and a reduction by a factor of

on the term. Similarly to the group lasso, DSML requires a lower bound only on the average signal strength, not on any individual coefficient. And as long as , or more precisely enjoys the same linear reduction in the dominant term of the required signal strength, match the leading term of the group lasso bound.

Based on Theorem 5, we have the following corollary that characterizes estimation error and prediction risk of DSML, with the proof given in the appendix. Suppose the conditions of Theorem 5 hold. With probability at least , we have


Let us compare these guarantees for to the group lasso. For DSML Corollary 2 yields:


When using the group lasso, the restricted eigenvalue condition is sufficient for obtaining error bounds and following holds for the group lasso [Corollary 4.1 of Lounici et al., 2011]:


which is min-max optimal (up to a logarithmic factor). Albeit with the stronger generalized coherence condition, DSML matches this bound when . Similarly for prediction DSML attains:


which in the same regime matches the group lasso minimax optimal rate:


In both cases, as long as is not too large, we have a linear improvement over Lasso, which corresponds to (13) and (15) with .

6 Experimental results

Figure 1: Hamming distance, estimation error, and prediction error for multi-task regression with . Top row: the number of tasks . Sample size per tasks is varied. Bottom row: Sample size . Number of tasks varied.

Our first set of experiments is on simulated data. We generated synthetic data according to the model in (16) and in (7). Rows of are sampled from a mean zero multivariate normal with the covariance matrix , . The data dimension is set to , while the number of true relevant variables is set to . Non-zero coefficients of are generated uniformly in

. Variance

is set to 1. Our simulation results are averaged over 200 independent runs.

We investigate how performance of various procedures changes as a function of problem parameters

. We compare the following procedures: i) local lasso, ii) group lasso, iii) refitted group lasso, where a worker node performs ordinary least squares on the selected support, iv) iCAP, and v) DSML. The parameters for local lasso, group lasso and iCAP were tuned to achieve the minimal Hamming error in variable selection. For DSML, to debias the output of local lasso estimator, we use

. The thresholding parameter is also optimized to achive the best variable selection performance. The simulation results for regression are shown in Figure 1. In terms of support recovery (measured by Hamming distance), Group lasso, iCAP, and DSML all perform similarly and significantly better than the local lasso. In terms of estimation error, lasso perform the worst, while DSML and refitted group lasso perform the best. This might be a result of bias removal introduced by regularization. Since the group lasso recovers the true support in most cases, refitting on it yields the maximum likelihood estimator on the true support. It is remarkable that DSML performs almost as well as this oracle estimator.

Figure 2 shows the simulation results for classification. Similar with the regression case, we make the following observations:

  • The group sparsity based approaches, including DSML, significantly outperform the individual lasso.

  • In terms of Hamming variable selection error, DSML performs slightly worse than group lasso and iCAP. While in terms of estimation error and prediction error, DSML performs much better than group lasso and icap. Given the fact that group lasso recovers the true support in most cases, refitted group lasso is equivalent to oracle maximum likelihood estimator. It is remarkable that DSML only performs slightly worse than refitted group lasso.

  • The advantage of DSML, as well as group lasso over individual lasso, becomes more and more significant with the increase in number of tasks.

Figure 2: Hamming distance, estimation error, and prediction error for multi-task classification with . Top row: the number of tasks . Sample size per tasks is varied. Bottom row: Sample size . Number of tasks varied.

We also evaluated DSML on the following benchmark data sets considered in previous investigations of shared support multi-task learning:

  • This is a widely used dataset for multi-task learning [Argyriou et al., 2008]. The goal is to predict the students’ performance at London’s secondary schools. There are 27 attributes for each student. The tasks are naturally divided according to different schools. We only considered schools with at least 200 students, which results in 11 tasks.

  • The task is to predict the protein secondary structure [Sander and Schneider, 1991]. We considered three binary classification tasks here: coil vs helix, helix vs strand, strand vs coil. The dataset consists of 24,387 instances in total, each with 357 features.

  • We consider the optical character recognition problem. Data were gathered by Rob Kassel at the MIT Spoken Language Systems Group 111http://www.seas.upenn.edu/~taskar/ocr/. Following [Obozinski et al., 2010], we consider the following 9 binary classification task: c vs e, g vs y, g vs s, m vs n, a vs g, i vs j, a vs o, f vs t, h vs n. Each image is represented by binary pixels.

  • This is a handwritten digit recognition dataset 222http://yann.lecun.com/exdb/mnist/, the ata consists of images that represent digits. Each image is represented by 784 pixels. We considered the following 5 binary classification task: 2 vs 4, 0 vs 9, 3 vs 5, 1 vs 7, 6 vs 8.

  • This dataset consists handwritten images from envelopes by the U.S. Postal Service. We considere the following 5 binary classification task: 2 vs 4, 0 vs 9, 3 vs 5, 1 vs 7, 6 vs 8. Each image is represented by 256 pixels.

  • We considered the vehicle classification problem in distributed sensor networks [Duarte and Hu, 2004]. We considered the following 3 binary classification task: AAV vs DW, AAV vs noise, DW vs noise. There are 98,528 instances in total, each instances is described by 50 acoustic features and 50 seismic features.

Figure 3: Comparison on real world datasets.

In addition to the procedures used in the previous section, we also compare against the dirty model Jalali et al. [2010], as well as the centralized approach that first debiases the group lasso and then performs group hard thresholding as in (5). Regularization and thresholding parameters were tuned on a held-out set consisting of of the data. In Figure 3 we report results of training on , and, of the total data set size. The multi-task methods clearly preform better than the local lasso, with DSML achieving similar error as the centralized methods.

7 Discussion

We introduced and studied a shared-sparsity distributed multi-task learning problem. We presented a novel communication-efficient approach that required only one round of communication and achieves provable guarantees that compete with the centralized approach to leading order up to a generous bound on the number of machines. Our analysis was based on Restricted Eigenvalue and Generalized Coherence conditions. Such conditions, or other similar conditions, are required for support recovery, but much weaker conditions are sufficient for obtaining low prediction error with the lasso or group lasso. An interesting open question is whether there exists a communication efficient method for distributed multi-task learning that requires sample complexity , like the group lasso, even without Restricted Eigenvalue and Generalized Coherence conditions, or whether beating the sample complexity of the lasso in a more general setting inherently requires large amounts of communication. Our methods, certainly, rely on these stronger conditions.

DSML can be easily extended to other types of structured sparsity, including sparse group lasso [Friedman et al., 2010], tree-guided group lasso [Kim and Xing, 2010] and the dirty model [Jalali et al., 2010]. Going beyond shared sparsity, shared subspace (i.e. low rank) and other matrix-factorization and feature-learning methods are also commonly and successfully used for multi-task learning, and it would be extremely interesting to understand distributed multi-task learning in these models.


  • Argyriou et al. [2008] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Mach. Learn., 73(3):243–272, 2008.
  • Balcan et al. [2012] Maria-Florina Balcan, Avrim Blum, Shai Fine, and Yishay Mansour. Distributed learning, communication complexity and privacy. In COLT, pages 26.1–26.22, 2012.
  • Bickel et al. [2009] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of lasso and Dantzig selector. Ann. Stat., 37(4):1705–1732, 2009.
  • Boyd et al. [2011] Stephen P. Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1–122, January 2011.
  • Bunea [2008] F. Bunea. Honest variable selection in linear and logistic regression models via and penalization. Electron. J. Stat., 2:1153–1194, 2008.
  • Caruana [1997] R. Caruana. Multitask learning. Mach. Learn., 28(1):41–75, 1997.
  • Cavalier et al. [2002] L. Cavalier, G. K. Golubev, D. Picard, and A. B. Tsybakov. Oracle inequalities for inverse problems. Ann. Statist., 30(3):843–874, 06 2002.
  • Chapelle et al. [2010] Olivier Chapelle, Pannagadatta K. Shivaswamy, Srinivas Vadrevu, Kilian Q. Weinberger, Ya Zhang, and Belle L. Tseng. Multi-task learning for boosting with application to web search ranking. In KDD, pages 1189–1198, 2010.
  • Dinuzzo et al. [2011] Francesco Dinuzzo, Gianluigi Pillonetto, and Giuseppe De Nicolao. Client-server multitask learning from distributed datasets.

    IEEE Transactions on Neural Networks

    , 22(2):290–303, 2011.
  • Duarte and Hu [2004] Marco F. Duarte and Yu Hen Hu. Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput., 64(7):826–838, July 2004. ISSN 0743-7315.
  • Friedman et al. [2010] Jerome H. Friedman, Trevor J. Hastie, and Robert J. Tibshirani. A note on the group lasso and a sparse group lasso. ArXiv e-prints, arXiv:1001.0736, 2010.
  • Jaggi et al. [2014] Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I. Jordan. Communication-efficient distributed dual coordinate ascent. In Proc. of NIPS, pages 3068–3076, 2014.
  • Jalali et al. [2010] Ali Jalali, Pradeep D. Ravikumar, Sujay Sanghavi, and Chao Ruan. A dirty model for multi-task learning. In Proc. of NIPS, pages 964–972, 2010.
  • Javanmard and Montanari [2014] Adel Javanmard and Andrea Montanari. Confidence intervals and hypothesis testing for high-dimensional regression. J. Mach. Learn. Res., 15(Oct):2869–2909, 2014.
  • Kim and Xing [2010] Seyoung Kim and Eric P. Xing. Tree-guided group lasso for multi-task regression with structured sparsity. In Proc. of ICML, pages 543–550, 2010.
  • Lee et al. [2015] Jason D. Lee, Yuekai Sun, Qiang Liu, and Jonathan E. Taylor. Communication-efficient sparse regression: a one-shot approach. ArXiv e-prints, arXiv:1503.04337, 2015.
  • Lounici et al. [2011] K. Lounici, M. Pontil, Alexandre B. Tsybakov, and Sara A. van de Geer. Oracle inequalities and optimal inference under group sparsity. Ann. Stat., 39:2164–204, 2011.
  • Obozinski et al. [2011] G. Obozinski, Martin J. Wainwright, and Michael I. Jordan. Support union recovery in high-dimensional multivariate regression. Ann. Stat., 39(1):1–47, 2011.
  • Obozinski et al. [2010] Guillaume Obozinski, Ben Taskar, and Michael I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2):231–252, 2010.
  • Ram et al. [2010] S. Ram, A. Nedić, and V. Veeravalli. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 147(3):516–545, 2010.
  • Rudelson and Zhou [2013] Mark Rudelson and Shuheng Zhou. Reconstruction from anisotropic random measurements. 2013.
  • Sander and Schneider [1991] Chris Sander and Reinhard Schneider. Database of homology-derived protein structures and the structural meaning of sequence alignment. Protein, 9:56–68, 1991.
  • Shamir et al. [2014] O. Shamir, N. Srebro, and T. Zhang. Communication efficient distributed optimization using an approximate newton-type method. In ICML, 2014.
  • Shamir and Srebro [2014] Ohad Shamir and Nathan Srebro. On distributed stochastic optimization and learning. In 52nd Annual Allerton Conference on Communication, Control and Computing, 2014.
  • Turlach et al. [2005] B. A. Turlach, W. N. Venables, and S. J. Wright. Simultaneous variable selection. Technometrics, 47(3):349–363, 2005.
  • van de Geer et al. [2014] Sara A. van de Geer, Peter Bühlmann, Ya’acov Ritov, and Ruben Dezeure. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Stat., 42(3):1166–1202, Jun 2014.
  • Vershynin [2012] Roman Vershynin.

    Introduction to the non-asymptotic analysis of random matrices.

    In Y. C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications. Cambridge University Press, 2012.
  • Wainwright [2009] Martin J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using -constrained quadratic programming (lasso). IEEE Trans. Inf. Theory, 55(5):2183–2202, 2009.
  • Weinberger et al. [2009] Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexander J. Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. In ICML, pages 1113–1120, 2009.
  • Yuan and Lin [2006] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. B, 68:49–67, 2006.
  • Zhang and Zhang [2013] Cun-Hui Zhang and Stephanie S. Zhang. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. B, 76(1):217–242, Jul 2013.
  • Zhang et al. [2013a] Yuchen Zhang, John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Proc. of NIPS, pages 2328–2336, 2013a.
  • Zhang et al. [2013b] Yuchen Zhang, John C. Duchi, and Martin J. Wainwright. Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res., 14(1):3321–3363, 2013b.
  • Zhao et al. [2009] Peng Zhao, Guilherme Rocha, and Bin Yu. The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Statist, pages 3468–3497, 2009.
  • Zhou et al. [2013] Jiayu Zhou, Jun Liu, Vaibhav A. Narayan, and Jieping Ye. Modeling disease progression via multi-task learning. NeuroImage, 78:233–248, 2013.

8 Appendix

Appendix A Proof of Theorem 1

We first introduce the following lemma. When the rows of are independent subgaussian random vectors, with mean zero, covariance , respectively. Let

Then with probability at least for some constant , we have


As shown in Theorem 2.4 of [Javanmard and Montanari, 2014], will be a feasible solution for the problem of estimating . Since we’re minimizing , we must have

Based on the concentration results of sub-exponential random variable

[Vershynin, 2012], also Lemma 3.3 of [Lee et al., 2015], we know with probability at least for some constant , we have

Take an union bound over , we obtain with probability at least ,

Now we are ready to prove Theorem 1, recall the model assumption


and the debiased estimation


we have

For the term , define

we have the following bound


Noticed that

Our next step uses a result on the concentration of random variables. For any coordinate , we have

where are standard normal random variables. Using Lemma C with a weight vector

and choosing , we have

A union bound over all gives us that with probability at least


Combining (18) and (19), we get the following estimation error bound:


where the first inequality uses the fact , and the second inequality uses (18) and (19)), the last inequality uses the fact that . For every variable , we have

plug in , , we obtain

From (20) and the choice of , we see that all variables not in will be excluded from as well. For every variable , we have

Therefore, all variables in will correctly stay in after the group hard thresholding.

Appendix B Proof of Corollary 

From Theorem 2 we have that and


with high probability. Summing over , we obtain the estimation error bound. For the prediction risk bound, we have

Using (21) and the fact that is row-wise -sparse, we obtain the prediction risk bound.

Appendix C Collection of known results

For completeness, we first give the definition of subgaussian norm, details could be found at [Vershynin, 2012].

[Subgaussian norm] The subgaussian norm of a subgaussian -dimensional random vector , is defined as

where is the -dimensional unit sphere.

We then define the restricted set as

The following proposition is a simple extension of Theorem 6.2 in [Bickel et al., 2009].


with some constant be the regularization parameter in lasso. With probability at least ,

where is the minimum restricted eigenvalue of design matrix :


Using Theorem 6.2 in [Bickel et al., 2009] and take an union bound over we obtain the result. ∎

[Equation (27) in [Cavalier et al., 2002]; Lemma B.1 in [Lounici et al., 2011]] Let be i.i.d. standard normal random variables, let , and . We have, for all , that