I Introduction
We consider a synchronous system that comprises of a server and machines, as shown in Fig. 1. Each machine holds a pair , where is a matrix of size and
is a columnvector of size
. Let denote the set of realvalued vectors of size . For a vector , let denote its 2norm. If denotes the transpose then .The objective of the server is to solve for the leastsquares problem:
(1) 
As the server does not know the values of ’s and ’s, it must collaborate with the machines to solve the problem.
The common approaches for solving such distributed optimization problems include distributed gradient descent (DGD) [nedic2009distributed, yuan2016convergence], distributed alternating direction method of multipliers (DADMM) [shi2014linear, zhang2014asynchronous] and distributed dual averaging method (DDA) [duchi2011dual, tsianos2012push].
Accelerated gradient methods, such as [azizan2019distributed, nesterov27method], can also solve for (1).
In this paper, we consider the distributed gradient descent (DGD) method, and propose a distributed iterative preconditioning technique for improving its convergence speed.
The DGD method is an iterative algorithm, in which the server maintains an estimate for a point of minimum of (
1), denoted by , and updates it iteratively by collaborating with the machines as follows. For each iteration , let denote the estimate of at the beginning of iteration . The initial value is chosen arbitrarily. In each iteration , the server broadcasts to all the machines. Each machine computes the gradient of the function at , denoted by , and sends it to the server. Note that(2) 
Upon receiving the gradients from all the machines, the server updates to using stepsize of constant value as follows:
(3) 
To be able to present the contribution of this paper, we first briefly review the convergence of the DGD method described above.
Ia Convergence of DGD
Let denote the matrix obtained by stacking matrices ’s vertically. So, matrix is of size . If matrix is fullcolumn rank then we know that there exists for which there is a positive value such that [fessler2008image],
The value is commonly referred as the convergence rate. Smaller implies higher convergence speed, and viceversa. If we let and
denote the largest and smallest eigenvalues of
, then it is known that [fessler2008image](4) 
The ratio is also commonly referred as the condition number of matrix which we denote by [fessler2008image].
IB PreConditioning of DGD
It is possible, however, to improve the convergence rate of DGD beyond (4) by proper preconditioning of the gradients in (3), described as follows. Let , referred as the preconditioner, be a square matrix of size . The server now updates its estimate as follows:
(5) 
If the matrix product is positive definite then convergence of (5) can be made linear by choosing appropriately, and the smallest possible convergence rate for (5) is given by [fessler2008image]
(6) 
However, most of the existing techniques, e.g. the incomplete LU factorization algorithms [meijerink1977iterative], accelerated iterative methods [axelsson1985survey], symmetric successive overrelaxation method [axelsson1985survey], for computing such a matrix such that is provably smaller than require the server to have direct access to the matrices ’s. The parallelizable preconditioning methods [axelsson1985survey], [benzi2002preconditioning] require to be symmetric positive definite matrix. Therefore, the existing preconditioning techniques can not be implemented for the considered distributed framework.
IC Summary of Contributions
We propose a varying preconditioner matrix , instead of a constant preconditioner matrix , for (5). That is, the master updates its estimate as follows:
(7) 
The preconditioner can be computed in a distributed manner by the master for each iteration , as is presented in Section II. We show that the iterative process (7) converges faster to the optimum point than the original DGD algorithm (3). In the experiments, we have also observed that the convergence speed of our proposed algorithm compares favourably to the accelerated projectionbased consensus (APC) method proposed in [azizan2019distributed].
Ii Proposed Algorithm
In this section, we present our distributed preconditioning technique that provably improves the convergence speed of the distributed gradient descent method (3) for solving the distributed least squares problem (1).
To be able to present our preconditioning technique, we introduce the following notation.

For a positive integer , let

Let denote the th column of the
dimensional identity matrix.

Let be the preconditioner matrix for iteration , and let denote the th column of .

For , define
(8)
The preconditioned DGD method is described below in Algorithm 1. Note that are positive valued parameters in the algorithm.
(9) 
(10) 
For each iteration , the server maintains a preconditioner matrix and an estimate of the point of optimum . The initial preconditioner matrix and estimate are chosen arbitrarily. The server sends and to the machines. Each machine computes for as given in (8), the local gradient as given in (2), and sends these to the server. Then, the server computes the updated preconditioner as given in (9), and uses this updated preconditioner to compute the updated estimate as given by (10).
Iia Convergence Analysis
To be able to present the convergence of Algorithm 1, we introduce the following notation.

Let,
where is the dimensional identity matrix.

Let be the th column of , .
We make the following assumption.
Assumption 1: Assume that the matrix is full rank. Let, and denote the smallest and largest eigenvalue of , respectively.
Then,
Lemma 1
Consider the iterative process (9), and let . Under Assumption 1, there exists for which there is a positive value such that for each ,Proof of Lemma 1 is in Appendix A.
The above lemma shows that, each column of the preconditioner matrix asymptotically converges to the corresponding column of . In other words, the matrix asymptotically converges to .
Proposition 1
If Assumption 1 holds and , then is positive definite, and the largest and the smallest eigenvalues of are and , respectively.Proof of Proposition 1 is deferred to Appendix B.
As shown in the next proposition, the smallest and the largest Eigenvalues of has direct influence on the convergence speed of Algorithm 1.
Proposition 2
Consider Algorithm 1 with . Under Assumption 1, there exists for which there is a positive value such thatProof of Proposition 2 is deferred to Appendix C.
According to Proposition 2, the error bound for estimate evolves with and after a finite time the rate of convergence is bounded by a constant of value less than 1.
Let the estimate of the optimum point computed by Algorithm 1 and DGD be denoted by and , respectively, after iterations. Define, and . For , we denote the upper bound on relative error by which is given as follows.
In the following theorem, we present a formal comparison between the convergence speed of Algorithm 1 and the original DGD method (3).
Theorem 1
Consider Algorithm 1 and the DGD algorithm (3) with identical initial estimate . If Assumption 1 holds, then there exists such that for all .Proof of Theorem 1 is deferred to Appendix D.
Theorem 1 implies that if the server executes both Algorithm 1 and the original DGD algorithm using the same initial estimate of , then after a certain number of iterations the upper bound on the distance between the estimates generated by Algorithm 1 is smaller than the original DGD.
Iii Experiments
In this section, we present the experimental results to verify the obtained theoretical convergence guarantees of Algorithm 1. The matrix is loaded from one of the real datasets in SuiteSparse Matrix Collection^{1}^{1}1https://sparse.tamu.edu.
Example . The matrix is a real symmetric positive definite matrix from the benchmark dataset “bcsstm07” which is part of a structural problem. The dimension of is and its condition number is . We generate the vector by setting where is dimensional vector of one’s. Since is positive definite, (1) has a unique solution . We split the data among machines, so that , .
We apply Algorithm 1 to solve the aforementioned optimization problem. It can be seen that, the speed of the algorithm depends on the parametric choices of (ref. Fig. 1(a)). Also, the algorithm converges to irrespective of the initial choice of the entries in and (ref. Fig. 1(b)).
The experimental results have been compared with the conventional distributed gradient descent (DGD) [yuan2016convergence] method and the accelerated projectionbased consensus (APC) [azizan2019distributed] method (ref. Fig. 3 & 4). We could not find any existing preconditioning technique for the distributed least squares problem (1) in the literature to compare with.
We compare the time needed by these three algorithms to reach a relative estimation error (defined as ) of . The tuning parameter has been set at for Algorithm 1. Rest of the algorithm parameters have been set such that the respective algorithms will have their smallest possible convergence rates. Specifically, for Algorithm 1, for APC and for DGD. We found that, Algorithm 1 and APC respectively need and seconds to reach relative estimation error , while DGD takes seconds to attain relative error . The recorded times are average over five runs of the corresponding algorithms.
Note that, evaluating the optimal tuning parameters for any of these algorithms requires knowledge about the smallest and largest Eigenvalues of . Clearly, Algorithm 1 performs much faster than DGD and has speed of convergence similar to APC. However, we have formally shown Algorithm 1 to be faster than the DGD algorithm, unlike the APC algorithm which has only been speculated to perform faster.
We notice that, the error norm for Algorithm 1 is less than that of DGD from an approximate iteration index onward (ref. Fig. 2(b)). Since is same for both of the algorithms, this observation is in agreement with the claim in Theorem 1. In case of the optimal algorithm parameters, APC is faster than Algorithm 1 up to a certain time, seconds where the relative error is approximately , but after that Algorithm 1 decreases faster (ref. Fig. 3(b)).
Iv Discussions
In this paper, we proposed an algorithm that distributively solves linear leastsquares minimization problems over a servermachine network architecture. The novelty of the proposed method lies in incorporating a timevarying preconditioning technique into the classical distributed gradient descent method in such a way that has been theoretically shown to boost convergence speed. The computation of the preconditioner is done at the server level without requiring any access to the data. In practice, we test the algorithm on a real dataset and get much better performance compared to the classical distributed gradient algorithm regarding the time needed for convergence to the true solution. Although an existing accelerated algorithm performs similar to the proposed method on this problem regarding the speed of convergence, it is not guaranteed theoretically. We, however, has formally shown the proposed algorithm to converge faster than the distributed gradient method.
References
Appendix
Iva Proof of Lemma 1
IvB Proof of Proposition 1
Consider the Eigen decomposition , where
is the Eigenvector matrix of
and is a diagonal matrix with the Eigenvalues of in the diagonal. Since and the identity matrix can be written as , we havewhere are the Eigenvalues of the positive definite matrix . Since , the claim follows.
IvC Proof of Proposition 2
Using (2) and observing that , dynamics (10) can be rewritten as
(13) 
Define . From (13),
(14) 
where comprises of the columns , and . From Lemma 1, it follows that
(15) 
Now,
(16) 
where the last inequality follows from (15). From (14) and (16),
(17) 
Since is positive definite for (from Proposition 1), for which and (17) holds. The smallest value of is .
Pick any . As for any , for any such that . Then
(18) 
Then with the largest and the smallest Eigenvalues of obtained in Proposition 1, the claim follows.
IvD Proof of Theorem 1
From Proposition 2, for any such that , . Since this holds for any , we can say that such that ,
(19) 
where depends only on and independent of . The inequality on the right hand side follows from the first statement in Proposition 2. For DGD, we have ,
(20) 
It can be easily checked that for . From (19) and (20), we have ,
where and . Since and do not depends on and , such that . This completes the proof.