One important problem in machine learning is to find the minimum of the expected loss,
is a loss function andhas a distribution . In practice, the minimizer needs to be estimated by observing samples drawn from distribution . In many applications or are very large, so distributed algorithms are necessary in such case. Without loss of generality, assume that and that the observations of -th machine are . We consider the high-dimensional learning problem where the dimension can be very large, and the effective variables are supported on and . Extensive efforts have been made to develop batch algorithms [1, 2, 3], which provide good convergence guarantees in optimization. However, when is large, batch algorithms are inefficiency, which takes at least time per iteration. Therefore, an emerging recent interest is observed to address this problem using the distributed optimization frameworks [5, 6, 7], which is more efficient than the stochastic algorithms. One important issue of existing distributed optimization for sparse learning is that they did not take advantage of the sparse structure, thus they have the same communication costs with general dense problems. In this paper, we propose a novel communication-efficient distributed algorithm to explicitly leverage the sparse structure for solving large scale sparse learning problems. This allows us to reduce the communication cost from in existing works to , while we still maintaining nearly the same performance under mild assumptions.
Notations For a sequence of numbers , we use to denote a sequence of numbers such that for some positive constant . Given two sequences of numbers and , we say if and if . The notation denotes that and
. For a vector, the -norm of is defined as , where ; the -norm of is defined as the number of its nonzero entries; the support of is defined as . For simplicity, we use to denote the set . For a matrix , we define the -norm of as . Given a number , the hard thresholding of a vector is defined by keeping the largest entries of (in magnitude) and setting the rest to be zero. Given a subset of index set , the projection of a vector on is defined by
is also denoted as for short.
I-a Related work
There is much previous work on distributed optimizations such as (Zinkevich et al. ; Dekel et al. ; Zhang et al. ; Shamir and Srebro ; Arjevani and Shamir ; Lee et al. ; Zhang and Xiao ). Initially, most distributed algorithms used averaging estimators formed by local machines (Zinkevich et al. ; Zhang et al. ). Then Zhang and Xiao , Shamir et al.  and Lee et al.  proposed more communication-efficient distributed optimization algorithms. More recently, using ideas of the approximate Newton-type method, Jordan et al.  and Wang et al.  further improved the computational efficiency of this type of method.
). They showed that under suitable conditions, the hard thresholding type first-order algorithms attain linear convergence to a solution which has optimal estimation accuracy with high probability. However, to the best of our knowledge, hard thresholding techniques applied to approximate Newton-type distributed algorithms has not been considered yet. So in this paper, we present some initial theoretical and experimental results on this topic.
In this section, we explain our approach to estimating the that minimizes the expected loss. The detailed steps are summarized in Algorithm 1.
First the empirical loss at each machine is defined as
At the beginning of algorithm, we solve a local Lasso subproblem to get an initial point. Specifically, at iteration , the master machine solves the minimization problem
The initial point is formed by keeping the largest elements of the resulting minimizer and setting the other elements to be zero, i.e., . Then, is broadcasted to the local machines, where it is used to compute a gradient of local empirical loss at , that is, . The local machines project on the support of and transmit the projection back to the master machine. Later at -th iteration (), the master solves a shifted regularized minimization subproblem:
Again the minimizer is truncated to form , and this quantity is communicated to the local machines, where it is used to compute the local gradient as before.
Solving subproblem is inspired by the approach of Wang et al.  and Jordan et al.. Note that the formulation takes advantage of both global first-order information and local higher-order information. Specially, assuming the and has an invertible Hessian, the solution of has the following closed form
which is similar to a Newton updating step. Note that here we add a projection procedure to reduce the number of nonzeros that need to be communicated to the master machine. This procedure is reasonable intuitively. First, when is close to , the elements of outside the support should be very small, so nominally little error is incurred in the truncation step. Second, when is also close to , the lost part has even more minimal effects on the inner product in subproblem . Third, we leave in out of the truncation to maintain the formulation as unbiased.
Iii Theoretical Analysis
Iii-a Main Theorem
We present some theoretical analysis of the proposed algorithm in this section. The loss is a -smooth function of the second argument, i.e.,
Moreover, the third derivative with respect to its second argument, , is bounded by a constant , i.e.,
The empirical loss function computed on the first machine satisfies that: , we have
where is defined as
The , and defined in Algorithm 1 satisfy the following condition: there exists some positive constants and and such that for ,
In practice, both and are very small even after only one round of communication and will decrease to fast in the later steps. For simplicity, we define the following notation:
Then with probability at least , we have that
where and are positive constants independent of . The theorem immediately implies the following convergence result. Suppose that for all
Then under the assumption of Theorem III-A we have
where and are defined in Theorem III-A and independent of . From the conclusion, we know that the hard thresholding parameter can be chosen as , where can be a moderate constant larger than . By contrast, previous work such as  solving a nonconvex minimization problem subject to constraint requires that , where is the condition number of the object function. Moreover, instead of only hard thresholding on the solution of Lasso subproblems, we also do projection on the gradients in (II). These help us reduce the communication cost from to .
Iii-B Sparse Linear Regression
In the sparse linear regression, dataare generated according to the model
where the noise
are i.i.d subgaussian random variables with zero mean. Usually the the loss function for this problem is the squared loss function, which is -smooth.
Combining Corollary III-A with some intermediate results obtained from [19, 20] and , we have the following bound for the estimation error. Suppose the design matrix and noise are subgaussian, Assumption III-A holds and is defined as (4). Then under the sparse linear model, we have the following estimation error bounds with probability at least :
where and are defined in Theorem III-A, and where
Under certain conditions we can further simplify the bound and have an insight of the relation between . When , it is easy to see by choosing
and there holds the following error bounds with high probabiltiy:
Iii-C Sparse Logistic Regression
Now we test our algorithm on both simulated data and real data. In both settings, we compare our algorithm with various advanced algorithms. These algorithms are:
EDSL: the state-of-the-art approach proposed by Jialei Wang et al. .
Centralize: using all data, one machine solves the centralized loss minimization problem with regularization. This procedure is communication expensive or requires much larger storage.
Local: the first machine solves the local regularized loss minimization problem with only the data stored on this machine, ignoring all the other data.
Two-way Truncation: the proposed sparse learning approach which further improves the communication efficiency.
Iv-a Simulated data
The simulated data
is sampled from multivariate Gaussian distribution with zero mean and covariance matrix. We choose two different covariance matrices: for a well-conditioned situation and for an ill-conditioned situation. The noise in sparse linear model () is set to be a standard Gaussian random variable. We set the true parameter to be -sparse where all the entries are zero except that the first
entries are i.i.d random variables from a uniform distribution in [0,1]. Under both two models, we set the hard thresholding parametergreater than s but less than .
Here we compare the algorithms in different settings of and plot the estimation error over rounds of communications. The results of sparse linear regression and sparse logistic regression are showed in Figure 1 and Figure 2. We can observe from these plots that:
First, there is indeed a large gap between the local estimation error and the centralized estimation error. The estimation errors of EDSL and the Two-way Truncation decrease to the centralized one in the first several rounds of communications.
Second, the Two-way Truncation algorithm is competitive with EDSL in both statistical accuracy and convergence rate as the theory indicated. Since it can converge in at least the same speed as EDSL’s and requires less communication and computation cost in each iteration, overall it’s more communicationally and computationally efficient.
The above results support the theory that the Two-way Truncation approach is indeed more efficient and competitive to the centralized approach and EDSL.
Iv-B Real data
In this section, we examine the above sparse learning algorithms on real-world datasets. The data comes from UCI Machine Learning Repository 111http://archive.ics.uci.edu/ml/ and the LIBSVM website 222https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/
. The high-dimensional data ’dna’ and ’a9a’ are used in the regression model and classification model respectively. We randomly partition the data infor training, validation and testing respectively. Here the data is divided randomly on machines and processed by algorithms mentioned above. The results are summarized in Figure 3. These results in real-world data experiments again validate the theoretical analysis that the proposed Two-way Truncation approach is a quite effective sparse learning method with very small communication and computation costs.
In this paper we propose a novel distributed sparse learning algorithm with Two-way Truncation. Theoretically, we prove that the algorithm gives an estimation that converges to the minimizer of the expected loss exponentially and attain nearly the same statistical accuracy as EDSL and the centralized method. Due to the truncation procedure, this algorithm is more efficient in both communication and computation. Extensive experiments on both simulated data and real data verify this statement.
The authors graciously acknowledge support from NSF Award CCF-1217751 and DARPA Young Faculty Award N66001-14-1-4047 and thank Jialei Wang for very useful suggestion.
-  Jerome Friedman, Trevor Hastie, Holger Höfling, Robert Tibshirani, et al., “Pathwise coordinate optimization,” The Annals of Applied Statistics, vol. 1, no. 2, pp. 302–332, 2007.
-  A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sciences, vol. 2, no. 1, pp. 183–202, 2009.
-  Lin Xiao and Tong Zhang, “A proximal-gradient homotopy method for the sparse least-squares problem,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 1062–1091, 2013.
-  Peter Bühlmann and Sara Van De Geer, Statistics for high-dimensional data: methods, theory and applications, Springer Science & Business Media, 2011.
-  Michael I Jordan, Jason D Lee, and Yun Yang, “Communication-efficient distributed statistical learning,” stat, vol. 1050, pp. 25, 2016.
-  Jason D Lee, Qihang Lin, Tengyu Ma, and Tianbao Yang, “Distributed stochastic variance reduced gradient methods and a lower bound for communication complexity,” arXiv preprint arXiv:1507.07595, 2015.
-  Jialei Wang, Mladen Kolar, Nathan Srebro, and Tong Zhang, “Efficient distributed learning with sparsity,” arXiv preprint arXiv:1605.07991, 2016.
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex J Smola,
“Parallelized stochastic gradient descent,”in Advances in neural information processing systems, 2010, pp. 2595–2603.
-  Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao, “Optimal distributed online prediction using mini-batches,” Journal of Machine Learning Research, vol. 13, no. Jan, pp. 165–202, 2012.
-  Yuchen Zhang, Martin J Wainwright, and John C Duchi, “Communication-efficient algorithms for statistical optimization,” in Advances in Neural Information Processing Systems, 2012, pp. 1502–1510.
-  Ohad Shamir and Nathan Srebro, “Distributed stochastic optimization and learning,” in Communication, Control, and Computing (Allerton), 2014 52nd Annual Allerton Conference on. IEEE, 2014, pp. 850–857.
-  Yossi Arjevani and Ohad Shamir, “Communication complexity of distributed convex learning and optimization,” in Advances in Neural Information Processing Systems, 2015, pp. 1756–1764.
-  Yuchen Zhang and Xiao Lin, “Disco: Distributed optimization for self-concordant empirical loss.,” in ICML, 2015, pp. 362–370.
-  Ohad Shamir, Nathan Srebro, and Tong Zhang, “Communication-efficient distributed optimization using an approximate newton-type method.,” in Proceedings of the International Conference on Machine Learning, 2014, vol. 32, pp. 1000–1008.
-  Jason D Lee, Yuekai Sun, Qiang Liu, and Jonathan E Taylor, “Communication-efficient sparse regression: a one-shot approach,” arXiv preprint arXiv:1503.04337, 2015.
-  Xiaotong Yuan, Ping Li, and Tong Zhang, “Gradient hard thresholding pursuit for sparsity-constrained optimization,” in Proceedings of the International Conference on Machine Learning, 2014, pp. 127–135.
Xingguo Li, Tuo Zhao, Raman Arora, Han Liu, and Jarvis Haupt,
“Stochastic variance reduced optimization for nonconvex sparse learning,”in Proceedings of the International Conference on Machine Learning, 2016, pp. 917–925.
-  Prateek Jain, Ambuj Tewari, and Purushottam Kar, “On iterative hard thresholding methods for high-dimensional m-estimation,” in Advances in Neural Information Processing Systems, 2014, pp. 685–693.
-  Mark Rudelson and Shuheng Zhou, “Reconstruction from anisotropic random measurements,” Ann Arbor, vol. 1001, pp. 48109, 2011.
-  Roman Vershynin, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010.
-  Martin J Wainwright, “Sharp thresholds for high-dimensional and noisy recovery of sparsity using l1-constrained quadratic programming,” IEEE Transactions on Information Theory, 2009.
-  Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu, “A unified framework for high dimensional analysis of m-estimators with decomposable regularizers.,” Statistical Science, vol. 27, no. 4, pp. 538–557, 2012.