We consider a synchronous system that comprises of a server and machines, as shown in Fig. 1. Each machine holds a pair , where is a matrix of size and
is a column-vector of size. Let denote the set of real-valued vectors of size . For a vector , let denote its 2-norm. If denotes the transpose then .
The objective of the server is to solve for the least-squares problem:
As the server does not know the values of ’s and ’s, it must collaborate with the machines to solve the problem.
The common approaches for solving such distributed optimization problems include distributed gradient descent (DGD) [nedic2009distributed, yuan2016convergence], distributed alternating direction method of multipliers (DADMM) [shi2014linear, zhang2014asynchronous] and distributed dual averaging method (DDA) [duchi2011dual, tsianos2012push].
Accelerated gradient methods, such as [azizan2019distributed, nesterov27method], can also solve for (1).
In this paper, we consider the distributed gradient descent (DGD) method, and propose a distributed iterative pre-conditioning technique for improving its convergence speed.
The DGD method is an iterative algorithm, in which the server maintains an estimate for a point of minimum of (1), denoted by , and updates it iteratively by collaborating with the machines as follows. For each iteration , let denote the estimate of at the beginning of iteration . The initial value is chosen arbitrarily. In each iteration , the server broadcasts to all the machines. Each machine computes the gradient of the function at , denoted by , and sends it to the server. Note that
Upon receiving the gradients from all the machines, the server updates to using step-size of constant value as follows:
To be able to present the contribution of this paper, we first briefly review the convergence of the DGD method described above.
I-a Convergence of DGD
Let denote the matrix obtained by stacking matrices ’s vertically. So, matrix is of size . If matrix is full-column rank then we know that there exists for which there is a positive value such that [fessler2008image],
The value is commonly referred as the convergence rate. Smaller implies higher convergence speed, and vice-versa. If we let and
denote the largest and smallest eigenvalues of, then it is known that [fessler2008image]
The ratio is also commonly referred as the condition number of matrix which we denote by [fessler2008image].
I-B Pre-Conditioning of DGD
It is possible, however, to improve the convergence rate of DGD beyond (4) by proper pre-conditioning of the gradients in (3), described as follows. Let , referred as the pre-conditioner, be a square matrix of size . The server now updates its estimate as follows:
However, most of the existing techniques, e.g. the incomplete LU factorization algorithms [meijerink1977iterative], accelerated iterative methods [axelsson1985survey], symmetric successive over-relaxation method [axelsson1985survey], for computing such a matrix such that is provably smaller than require the server to have direct access to the matrices ’s. The parallelizable preconditioning methods [axelsson1985survey], [benzi2002preconditioning] require to be symmetric positive definite matrix. Therefore, the existing pre-conditioning techniques can not be implemented for the considered distributed framework.
I-C Summary of Contributions
We propose a varying pre-conditioner matrix , instead of a constant pre-conditioner matrix , for (5). That is, the master updates its estimate as follows:
The pre-conditioner can be computed in a distributed manner by the master for each iteration , as is presented in Section II. We show that the iterative process (7) converges faster to the optimum point than the original DGD algorithm (3). In the experiments, we have also observed that the convergence speed of our proposed algorithm compares favourably to the accelerated projection-based consensus (APC) method proposed in [azizan2019distributed].
Ii Proposed Algorithm
In this section, we present our distributed pre-conditioning technique that provably improves the convergence speed of the distributed gradient descent method (3) for solving the distributed least squares problem (1).
To be able to present our pre-conditioning technique, we introduce the following notation.
For a positive integer , let
Let denote the -th column of the
-dimensional identity matrix.
Let be the pre-conditioner matrix for iteration , and let denote the -th column of .
For , define
The pre-conditioned DGD method is described below in Algorithm 1. Note that are positive valued parameters in the algorithm.
For each iteration , the server maintains a pre-conditioner matrix and an estimate of the point of optimum . The initial pre-conditioner matrix and estimate are chosen arbitrarily. The server sends and to the machines. Each machine computes for as given in (8), the local gradient as given in (2), and sends these to the server. Then, the server computes the updated pre-conditioner as given in (9), and uses this updated pre-conditioner to compute the updated estimate as given by (10).
Ii-a Convergence Analysis
To be able to present the convergence of Algorithm 1, we introduce the following notation.
where is the -dimensional identity matrix.
Let be the -th column of , .
We make the following assumption.
Assumption 1: Assume that the matrix is full rank. Let, and denote the smallest and largest eigenvalue of , respectively.
Lemma 1Consider the iterative process (9), and let . Under Assumption 1, there exists for which there is a positive value such that for each ,
Proof of Lemma 1 is in Appendix A.
The above lemma shows that, each column of the pre-conditioner matrix asymptotically converges to the corresponding column of . In other words, the matrix asymptotically converges to .
Proposition 1If Assumption 1 holds and , then is positive definite, and the largest and the smallest eigenvalues of are and , respectively.
Proof of Proposition 1 is deferred to Appendix B.
As shown in the next proposition, the smallest and the largest Eigenvalues of has direct influence on the convergence speed of Algorithm 1.
Proposition 2Consider Algorithm 1 with . Under Assumption 1, there exists for which there is a positive value such that
Proof of Proposition 2 is deferred to Appendix C.
According to Proposition 2, the error bound for estimate evolves with and after a finite time the rate of convergence is bounded by a constant of value less than 1.
Let the estimate of the optimum point computed by Algorithm 1 and DGD be denoted by and , respectively, after iterations. Define, and . For , we denote the upper bound on relative error by which is given as follows.
Theorem 1Consider Algorithm 1 and the DGD algorithm (3) with identical initial estimate . If Assumption 1 holds, then there exists such that for all .
Proof of Theorem 1 is deferred to Appendix D.
Theorem 1 implies that if the server executes both Algorithm 1 and the original DGD algorithm using the same initial estimate of , then after a certain number of iterations the upper bound on the distance between the estimates generated by Algorithm 1 is smaller than the original DGD.
In this section, we present the experimental results to verify the obtained theoretical convergence guarantees of Algorithm 1. The matrix is loaded from one of the real data-sets in SuiteSparse Matrix Collection111https://sparse.tamu.edu.
Example . The matrix is a real symmetric positive definite matrix from the benchmark dataset “bcsstm07” which is part of a structural problem. The dimension of is and its condition number is . We generate the vector by setting where is dimensional vector of one’s. Since is positive definite, (1) has a unique solution . We split the data among machines, so that , .
We apply Algorithm 1 to solve the aforementioned optimization problem. It can be seen that, the speed of the algorithm depends on the parametric choices of (ref. Fig. 1(a)). Also, the algorithm converges to irrespective of the initial choice of the entries in and (ref. Fig. 1(b)).
The experimental results have been compared with the conventional distributed gradient descent (DGD) [yuan2016convergence] method and the accelerated projection-based consensus (APC) [azizan2019distributed] method (ref. Fig. 3 & 4). We could not find any existing preconditioning technique for the distributed least squares problem (1) in the literature to compare with.
We compare the time needed by these three algorithms to reach a relative estimation error (defined as ) of . The tuning parameter has been set at for Algorithm 1. Rest of the algorithm parameters have been set such that the respective algorithms will have their smallest possible convergence rates. Specifically, for Algorithm 1, for APC and for DGD. We found that, Algorithm 1 and APC respectively need and seconds to reach relative estimation error , while DGD takes seconds to attain relative error . The recorded times are average over five runs of the corresponding algorithms.
Note that, evaluating the optimal tuning parameters for any of these algorithms requires knowledge about the smallest and largest Eigenvalues of . Clearly, Algorithm 1 performs much faster than DGD and has speed of convergence similar to APC. However, we have formally shown Algorithm 1 to be faster than the DGD algorithm, unlike the APC algorithm which has only been speculated to perform faster.
We notice that, the error norm for Algorithm 1 is less than that of DGD from an approximate iteration index onward (ref. Fig. 2(b)). Since is same for both of the algorithms, this observation is in agreement with the claim in Theorem 1. In case of the optimal algorithm parameters, APC is faster than Algorithm 1 up to a certain time, seconds where the relative error is approximately , but after that Algorithm 1 decreases faster (ref. Fig. 3(b)).
In this paper, we proposed an algorithm that distributively solves linear least-squares minimization problems over a server-machine network architecture. The novelty of the proposed method lies in incorporating a time-varying pre-conditioning technique into the classical distributed gradient descent method in such a way that has been theoretically shown to boost convergence speed. The computation of the pre-conditioner is done at the server level without requiring any access to the data. In practice, we test the algorithm on a real data-set and get much better performance compared to the classical distributed gradient algorithm regarding the time needed for convergence to the true solution. Although an existing accelerated algorithm performs similar to the proposed method on this problem regarding the speed of convergence, it is not guaranteed theoretically. We, however, has formally shown the proposed algorithm to converge faster than the distributed gradient method.
Iv-a Proof of Lemma 1
Iv-B Proof of Proposition 1
Consider the Eigen decomposition , where
is the Eigenvector matrix ofand is a diagonal matrix with the Eigenvalues of in the diagonal. Since and the identity matrix can be written as , we have
where are the Eigenvalues of the positive definite matrix . Since , the claim follows.
Iv-C Proof of Proposition 2
Define . From (13),
where comprises of the columns , and . From Lemma 1, it follows that
Pick any . As for any , for any such that . Then
Then with the largest and the smallest Eigenvalues of obtained in Proposition 1, the claim follows.
Iv-D Proof of Theorem 1
From Proposition 2, for any such that , . Since this holds for any , we can say that such that ,
where depends only on and independent of . The inequality on the right hand side follows from the first statement in Proposition 2. For DGD, we have ,
where and . Since and do not depends on and , such that . This completes the proof.