I Introduction
One important problem in machine learning is to find the minimum of the expected loss,
(1) 
Here
is a loss function and
has a distribution . In practice, the minimizer needs to be estimated by observing samples drawn from distribution . In many applications or are very large, so distributed algorithms are necessary in such case. Without loss of generality, assume that and that the observations of th machine are . We consider the highdimensional learning problem where the dimension can be very large, and the effective variables are supported on and . Extensive efforts have been made to develop batch algorithms [1, 2, 3], which provide good convergence guarantees in optimization. However, when is large, batch algorithms are inefficiency, which takes at least time per iteration. Therefore, an emerging recent interest is observed to address this problem using the distributed optimization frameworks [5, 6, 7], which is more efficient than the stochastic algorithms. One important issue of existing distributed optimization for sparse learning is that they did not take advantage of the sparse structure, thus they have the same communication costs with general dense problems. In this paper, we propose a novel communicationefficient distributed algorithm to explicitly leverage the sparse structure for solving large scale sparse learning problems. This allows us to reduce the communication cost from in existing works to , while we still maintaining nearly the same performance under mild assumptions.Notations For a sequence of numbers , we use to denote a sequence of numbers such that for some positive constant . Given two sequences of numbers and , we say if and if . The notation denotes that and
. For a vector
, the norm of is defined as , where ; the norm of is defined as the number of its nonzero entries; the support of is defined as . For simplicity, we use to denote the set . For a matrix , we define the norm of as . Given a number , the hard thresholding of a vector is defined by keeping the largest entries of (in magnitude) and setting the rest to be zero. Given a subset of index set , the projection of a vector on is defined byis also denoted as for short.
Ia Related work
There is much previous work on distributed optimizations such as (Zinkevich et al. [8]; Dekel et al. [9]; Zhang et al. [10]; Shamir and Srebro [11]; Arjevani and Shamir [12]; Lee et al. [6]; Zhang and Xiao [13]). Initially, most distributed algorithms used averaging estimators formed by local machines (Zinkevich et al. [8]; Zhang et al. [10]). Then Zhang and Xiao [13], Shamir et al. [14] and Lee et al. [15] proposed more communicationefficient distributed optimization algorithms. More recently, using ideas of the approximate Newtontype method, Jordan et al. [5] and Wang et al. [7] further improved the computational efficiency of this type of method.
Many gradient hard thresholding approaches are proposed in recent years such as (Yuan et al. [16]; Li et al. [17]; Jain et al. [18]
). They showed that under suitable conditions, the hard thresholding type firstorder algorithms attain linear convergence to a solution which has optimal estimation accuracy with high probability. However, to the best of our knowledge, hard thresholding techniques applied to approximate Newtontype distributed algorithms has not been considered yet. So in this paper, we present some initial theoretical and experimental results on this topic.
Ii Algorithm
In this section, we explain our approach to estimating the that minimizes the expected loss. The detailed steps are summarized in Algorithm 1.
First the empirical loss at each machine is defined as
At the beginning of algorithm, we solve a local Lasso subproblem to get an initial point. Specifically, at iteration , the master machine solves the minimization problem
(2) 
The initial point is formed by keeping the largest elements of the resulting minimizer and setting the other elements to be zero, i.e., . Then, is broadcasted to the local machines, where it is used to compute a gradient of local empirical loss at , that is, . The local machines project on the support of and transmit the projection back to the master machine. Later at th iteration (), the master solves a shifted regularized minimization subproblem:
(3) 
Again the minimizer is truncated to form , and this quantity is communicated to the local machines, where it is used to compute the local gradient as before.
Solving subproblem is inspired by the approach of Wang et al. [7] and Jordan et al.[5]. Note that the formulation takes advantage of both global firstorder information and local higherorder information. Specially, assuming the and has an invertible Hessian, the solution of has the following closed form
which is similar to a Newton updating step. Note that here we add a projection procedure to reduce the number of nonzeros that need to be communicated to the master machine. This procedure is reasonable intuitively. First, when is close to , the elements of outside the support should be very small, so nominally little error is incurred in the truncation step. Second, when is also close to , the lost part has even more minimal effects on the inner product in subproblem . Third, we leave in out of the truncation to maintain the formulation as unbiased.
Iii Theoretical Analysis
Iiia Main Theorem
We present some theoretical analysis of the proposed algorithm in this section. The loss is a smooth function of the second argument, i.e.,
Moreover, the third derivative with respect to its second argument, , is bounded by a constant , i.e.,
The empirical loss function computed on the first machine satisfies that: , we have
where is defined as
The , and defined in Algorithm 1 satisfy the following condition: there exists some positive constants and and such that for ,
In practice, both and are very small even after only one round of communication and will decrease to fast in the later steps. For simplicity, we define the following notation:
Now we state our main theorem. Suppose that Assumption IIIA, IIIA, and IIIA hold. Let with and
(4) 
where .
Then with probability at least , we have that
where and are positive constants independent of . The theorem immediately implies the following convergence result. Suppose that for all
(5) 
where .
Then under the assumption of Theorem IIIA we have
where
and
where and are defined in Theorem IIIA and independent of . From the conclusion, we know that the hard thresholding parameter can be chosen as , where can be a moderate constant larger than . By contrast, previous work such as [17] solving a nonconvex minimization problem subject to constraint requires that , where is the condition number of the object function. Moreover, instead of only hard thresholding on the solution of Lasso subproblems, we also do projection on the gradients in (II). These help us reduce the communication cost from to .
IiiB Sparse Linear Regression
In the sparse linear regression, data
are generated according to the model(6) 
where the noise
are i.i.d subgaussian random variables with zero mean. Usually the the loss function for this problem is the squared loss function
, which is smooth.Combining Corollary IIIA with some intermediate results obtained from [19, 20] and [21], we have the following bound for the estimation error. Suppose the design matrix and noise are subgaussian, Assumption IIIA holds and is defined as (4). Then under the sparse linear model, we have the following estimation error bounds with probability at least :
and
where and are defined in Theorem IIIA, and where
and
Under certain conditions we can further simplify the bound and have an insight of the relation between . When , it is easy to see by choosing
and there holds the following error bounds with high probabiltiy:
IiiC Sparse Logistic Regression
Combining Corollary IIIA with some intermediate results obtained from [7] and [22]
, we now can give a similar result about the estimation error bound for sparse logistic regression. The explicit form is omitted due to the limitation of spaces.
Iv Experiments
Now we test our algorithm on both simulated data and real data. In both settings, we compare our algorithm with various advanced algorithms. These algorithms are:

EDSL: the stateoftheart approach proposed by Jialei Wang et al. [7].

Centralize: using all data, one machine solves the centralized loss minimization problem with regularization. This procedure is communication expensive or requires much larger storage.

Local: the first machine solves the local regularized loss minimization problem with only the data stored on this machine, ignoring all the other data.

Twoway Truncation: the proposed sparse learning approach which further improves the communication efficiency.
Iva Simulated data
The simulated data
is sampled from multivariate Gaussian distribution with zero mean and covariance matrix
. We choose two different covariance matrices: for a wellconditioned situation and for an illconditioned situation. The noise in sparse linear model () is set to be a standard Gaussian random variable. We set the true parameter to be sparse where all the entries are zero except that the firstentries are i.i.d random variables from a uniform distribution in [0,1]. Under both two models, we set the hard thresholding parameter
greater than s but less than .Here we compare the algorithms in different settings of and plot the estimation error over rounds of communications. The results of sparse linear regression and sparse logistic regression are showed in Figure 1 and Figure 2. We can observe from these plots that:

First, there is indeed a large gap between the local estimation error and the centralized estimation error. The estimation errors of EDSL and the Twoway Truncation decrease to the centralized one in the first several rounds of communications.

Second, the Twoway Truncation algorithm is competitive with EDSL in both statistical accuracy and convergence rate as the theory indicated. Since it can converge in at least the same speed as EDSL’s and requires less communication and computation cost in each iteration, overall it’s more communicationally and computationally efficient.
The above results support the theory that the Twoway Truncation approach is indeed more efficient and competitive to the centralized approach and EDSL.
IvB Real data
In this section, we examine the above sparse learning algorithms on realworld datasets. The data comes from UCI Machine Learning Repository ^{1}^{1}1http://archive.ics.uci.edu/ml/ and the LIBSVM website ^{2}^{2}2https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/
. The highdimensional data ’dna’ and ’a9a’ are used in the regression model and classification model respectively. We randomly partition the data in
for training, validation and testing respectively. Here the data is divided randomly on machines and processed by algorithms mentioned above. The results are summarized in Figure 3. These results in realworld data experiments again validate the theoretical analysis that the proposed Twoway Truncation approach is a quite effective sparse learning method with very small communication and computation costs.V Conclusions
In this paper we propose a novel distributed sparse learning algorithm with Twoway Truncation. Theoretically, we prove that the algorithm gives an estimation that converges to the minimizer of the expected loss exponentially and attain nearly the same statistical accuracy as EDSL and the centralized method. Due to the truncation procedure, this algorithm is more efficient in both communication and computation. Extensive experiments on both simulated data and real data verify this statement.
Acknowledgment
The authors graciously acknowledge support from NSF Award CCF1217751 and DARPA Young Faculty Award N660011414047 and thank Jialei Wang for very useful suggestion.
References
 [1] Jerome Friedman, Trevor Hastie, Holger Höfling, Robert Tibshirani, et al., “Pathwise coordinate optimization,” The Annals of Applied Statistics, vol. 1, no. 2, pp. 302–332, 2007.
 [2] A. Beck and M. Teboulle, “A fast iterative shrinkagethresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
 [3] Lin Xiao and Tong Zhang, “A proximalgradient homotopy method for the sparse leastsquares problem,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 1062–1091, 2013.
 [4] Peter Bühlmann and Sara Van De Geer, Statistics for highdimensional data: methods, theory and applications, Springer Science & Business Media, 2011.
 [5] Michael I Jordan, Jason D Lee, and Yun Yang, “Communicationefficient distributed statistical learning,” stat, vol. 1050, pp. 25, 2016.
 [6] Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang, “Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity,” arXiv preprint arXiv:1507.07595, 2015.
 [7] Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang, “Efficient distributed learning with sparsity,” arXiv preprint arXiv:1605.07991, 2016.

[8]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola,
“Parallelized stochastic gradient descent,”
in Advances in neural information processing systems, 2010, pp. 2595–2603.  [9] Ofer Dekel, Ran GiladBachrach, Ohad Shamir, and Lin Xiao, “Optimal distributed online prediction using minibatches,” Journal of Machine Learning Research, vol. 13, no. Jan, pp. 165–202, 2012.
 [10] Yuchen Zhang, Martin J Wainwright, and John C Duchi, “Communicationefficient algorithms for statistical optimization,” in Advances in Neural Information Processing Systems, 2012, pp. 1502–1510.
 [11] Ohad Shamir and Nathan Srebro, “Distributed stochastic optimization and learning,” in Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on. IEEE, 2014, pp. 850–857.
 [12] Yossi Arjevani and Ohad Shamir, “Communication complexity of distributed convex learning and optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 1756–1764.
 [13] Yuchen Zhang and Xiao Lin, “Disco: Distributed optimization for selfconcordant empirical loss.,” in ICML, 2015, pp. 362–370.
 [14] Ohad Shamir, Nathan Srebro, and Tong Zhang, “Communicationefficient distributed optimization using an approximate newtontype method.,” in Proceedings of the International Conference on Machine Learning, 2014, vol. 32, pp. 1000–1008.
 [15] Jason D Lee, Yuekai Sun, Qiang Liu, and Jonathan E Taylor, “Communicationefficient sparse regression: a oneshot approach,” arXiv preprint arXiv:1503.04337, 2015.
 [16] Xiaotong Yuan, Ping Li, and Tong Zhang, “Gradient hard thresholding pursuit for sparsityconstrained optimization,” in Proceedings of the International Conference on Machine Learning, 2014, pp. 127–135.

[17]
Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt,
“Stochastic variance reduced optimization for nonconvex sparse learning,”
in Proceedings of the International Conference on Machine Learning, 2016, pp. 917–925.  [18] Prateek Jain, Ambuj Tewari, and Purushottam Kar, “On iterative hard thresholding methods for highdimensional mestimation,” in Advances in Neural Information Processing Systems, 2014, pp. 685–693.
 [19] Mark Rudelson and Shuheng Zhou, “Reconstruction from anisotropic random measurements,” Ann Arbor, vol. 1001, pp. 48109, 2011.
 [20] Roman Vershynin, “Introduction to the nonasymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010.
 [21] Martin J Wainwright, “Sharp thresholds for highdimensional and noisy recovery of sparsity using l1constrained quadratic programming,” IEEE Transactions on Information Theory, 2009.
 [22] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu, “A unified framework for high dimensional analysis of mestimators with decomposable regularizers.,” Statistical Science, vol. 27, no. 4, pp. 538–557, 2012.
Comments
There are no comments yet.