Iterative Pre-Conditioning to Expedite the Gradient-Descent Method

03/13/2020 ∙ by Kushal Chakrabarti, et al. ∙ University of Maryland Georgetown University 12

Gradient-descent method is one of the most widely used and perhaps the most natural method for solving an unconstrained minimization problem. The method is quite simple and can be implemented easily in distributed settings, which is the focus of this paper. We consider a distributed system of multiple agents where each agent has a local cost function, and the goal for the agents is to minimize the sum of their local cost functions. In principle, the distributed minimization problem can be solved by the agents using the traditional gradient-descent method. However, the convergence rate (or speed) of the gradient-descent method is bounded by the condition number of the minimization problem. Consequentially, when the minimization problem to be solved is ill-conditioned, the gradient-descent method may require a large number of iterations to converge to the solution. Indeed, in many practical situations, the minimization problem that needs to be solved is ill-conditioned. In this paper, we propose an iterative pre-conditioning method that significantly reduces the impact of the conditioning of the minimization problem on the convergence rate of the traditional gradient-descent algorithm. The proposed pre-conditioning method can be implemented with ease in the considered distributed setting. For now, we only consider a special case of the distributed minimization problem where the local cost functions of the agents are linear squared-errors. Besides the theoretical guarantees, the improved convergence due to our pre-conditioning method is also demonstrated through experiments on a real data-set.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We consider a synchronous system that comprises of a server and machines, as shown in Fig. 1. Each machine holds a pair , where is a matrix of size and

is a column-vector of size

. Let denote the set of real-valued vectors of size . For a vector , let denote its 2-norm. If denotes the transpose then .

Server

Machine 2

Machine 1

Machine m
Fig. 1: Schematic diagram of server-machine architecture with machines.

The objective of the server is to solve for the least-squares problem:

(1)

As the server does not know the values of ’s and ’s, it must collaborate with the machines to solve the problem.

The common approaches for solving such distributed optimization problems include distributed gradient descent (DGD) [nedic2009distributed, yuan2016convergence], distributed alternating direction method of multipliers (DADMM) [shi2014linear, zhang2014asynchronous] and distributed dual averaging method (DDA) [duchi2011dual, tsianos2012push]. Accelerated gradient methods, such as [azizan2019distributed, nesterov27method], can also solve for (1). In this paper, we consider the distributed gradient descent (DGD) method, and propose a distributed iterative pre-conditioning technique for improving its convergence speed.

The DGD method is an iterative algorithm, in which the server maintains an estimate for a point of minimum of (

1), denoted by , and updates it iteratively by collaborating with the machines as follows. For each iteration , let denote the estimate of at the beginning of iteration . The initial value is chosen arbitrarily. In each iteration , the server broadcasts to all the machines. Each machine computes the gradient of the function at , denoted by , and sends it to the server. Note that

(2)

Upon receiving the gradients from all the machines, the server updates to using step-size of constant value as follows:

(3)

To be able to present the contribution of this paper, we first briefly review the convergence of the DGD method described above.

I-a Convergence of DGD

Let denote the matrix obtained by stacking matrices ’s vertically. So, matrix is of size . If matrix is full-column rank then we know that there exists for which there is a positive value such that [fessler2008image],

The value is commonly referred as the convergence rate. Smaller implies higher convergence speed, and vice-versa. If we let and

denote the largest and smallest eigenvalues of

, then it is known that [fessler2008image]

(4)

The ratio is also commonly referred as the condition number of matrix which we denote by  [fessler2008image].

I-B Pre-Conditioning of DGD

It is possible, however, to improve the convergence rate of DGD beyond (4) by proper pre-conditioning of the gradients in (3), described as follows. Let , referred as the pre-conditioner, be a square matrix of size . The server now updates its estimate as follows:

(5)

If the matrix product is positive definite then convergence of (5) can be made linear by choosing appropriately, and the smallest possible convergence rate for (5) is given by [fessler2008image]

(6)

However, most of the existing techniques, e.g. the incomplete LU factorization algorithms [meijerink1977iterative], accelerated iterative methods [axelsson1985survey], symmetric successive over-relaxation method [axelsson1985survey], for computing such a matrix such that is provably smaller than require the server to have direct access to the matrices ’s. The parallelizable preconditioning methods [axelsson1985survey][benzi2002preconditioning] require to be symmetric positive definite matrix. Therefore, the existing pre-conditioning techniques can not be implemented for the considered distributed framework.

I-C Summary of Contributions

We propose a varying pre-conditioner matrix , instead of a constant pre-conditioner matrix , for (5). That is, the master updates its estimate as follows:

(7)

The pre-conditioner can be computed in a distributed manner by the master for each iteration , as is presented in Section II. We show that the iterative process (7) converges faster to the optimum point than the original DGD algorithm (3). In the experiments, we have also observed that the convergence speed of our proposed algorithm compares favourably to the accelerated projection-based consensus (APC) method proposed in [azizan2019distributed].

Ii Proposed Algorithm

In this section, we present our distributed pre-conditioning technique that provably improves the convergence speed of the distributed gradient descent method (3) for solving the distributed least squares problem (1).

To be able to present our pre-conditioning technique, we introduce the following notation.

  • For a positive integer , let

  • Let denote the -th column of the

    -dimensional identity matrix.

  • Let be the pre-conditioner matrix for iteration , and let denote the -th column of .

  • For , define

    (8)

The pre-conditioned DGD method is described below in Algorithm 1. Note that are positive valued parameters in the algorithm.

1:Initialize ,
2:for  do
3:     The server transmits and to all the machines
4:     for each machine  do
5:         Compute
6:         for each  do
7:              Compute given by (8)
8:         end for
9:         Transmit and to the server
10:     end for
11:     for each  do
12:         The server updates
(9)
13:     end for
14:     The server updates the estimate
(10)
15:end for
Algorithm 1

For each iteration , the server maintains a pre-conditioner matrix and an estimate of the point of optimum . The initial pre-conditioner matrix and estimate are chosen arbitrarily. The server sends and to the machines. Each machine computes for as given in (8), the local gradient as given in (2), and sends these to the server. Then, the server computes the updated pre-conditioner as given in (9), and uses this updated pre-conditioner to compute the updated estimate as given by (10).

Ii-a Convergence Analysis

To be able to present the convergence of Algorithm 1, we introduce the following notation.

  • Let,

    where is the -dimensional identity matrix.

  • Let be the -th column of , .

We make the following assumption.
Assumption 1: Assume that the matrix is full rank. Let, and denote the smallest and largest eigenvalue of , respectively.

Then,

Lemma 1
Consider the iterative process (9), and let . Under Assumption 1, there exists for which there is a positive value such that for each ,
Moreover, the smallest value of is given by
 
Proof of Lemma 1 is in Appendix A.

The above lemma shows that, each column of the pre-conditioner matrix asymptotically converges to the corresponding column of . In other words, the matrix asymptotically converges to .

Proposition 1
If Assumption 1 holds and , then is positive definite, and the largest and the smallest eigenvalues of are and , respectively.
 

Proof of Proposition 1 is deferred to Appendix B.

As shown in the next proposition, the smallest and the largest Eigenvalues of has direct influence on the convergence speed of Algorithm 1.

Proposition 2
Consider Algorithm 1 with . Under Assumption 1, there exists for which there is a positive value such that
where . Moreover, for any such that
and the smallest convergence rate of Algorithm 1 is , where .
 

Proof of Proposition 2 is deferred to Appendix C.

According to Proposition 2, the error bound for estimate evolves with and after a finite time the rate of convergence is bounded by a constant of value less than 1.

Let the estimate of the optimum point computed by Algorithm 1 and DGD be denoted by and , respectively, after iterations. Define, and . For , we denote the upper bound on relative error by which is given as follows.

In the following theorem, we present a formal comparison between the convergence speed of Algorithm 1 and the original DGD method (3).

Theorem 1
Consider Algorithm 1 and the DGD algorithm (3) with identical initial estimate . If Assumption 1 holds, then there exists such that for all .
 

Proof of Theorem 1 is deferred to Appendix D.

Theorem 1 implies that if the server executes both Algorithm 1 and the original DGD algorithm using the same initial estimate of , then after a certain number of iterations the upper bound on the distance between the estimates generated by Algorithm 1 is smaller than the original DGD.

(a)
(b)
Fig. 2: Temporal evolution of error norm for estimate for Example 1, under Algorithm 1 with different parameter choices and initialization. (a) ; (b) .
(a)
(b)
Fig. 3: Temporal evolution of error norm for estimate against number of iterations for Example 1, under Algorithm 1, DGD [yuan2016convergence] and APC [azizan2019distributed]; with (a) arbitrary parameter choices (b) optimal parameter choices. Initialization for (a) and (b) both: (Algorithm 1) , ; (DGD) ; (APC) according to the algorithm. In (a): (Algorithm 1) ; (DGD) ; (APC) . In (b): (Algorithm 1) ; (DGD) ; (APC) .
(a)
(b)
Fig. 4: Temporal evolution of error norm for estimate against time needed for Example 1, under Algorithm 1, DGD [yuan2016convergence] and APC [azizan2019distributed]; with (a) arbitrary parameter choices (b) optimal parameter choices. Initialization for (a) and (b) both: (Algorithm 1) , ; (DGD) ; (APC) according to the algorithm. In (a): (Algorithm 1) ; (DGD) ; (APC) . In (b): (Algorithm 1) ; (DGD) ; (APC) .

Iii Experiments

In this section, we present the experimental results to verify the obtained theoretical convergence guarantees of Algorithm 1. The matrix is loaded from one of the real data-sets in SuiteSparse Matrix Collection111https://sparse.tamu.edu.

Example . The matrix is a real symmetric positive definite matrix from the benchmark dataset “bcsstm07” which is part of a structural problem. The dimension of is and its condition number is . We generate the vector by setting where is dimensional vector of one’s. Since is positive definite, (1) has a unique solution . We split the data among machines, so that , .

We apply Algorithm 1 to solve the aforementioned optimization problem. It can be seen that, the speed of the algorithm depends on the parametric choices of (ref. Fig. 1(a)). Also, the algorithm converges to irrespective of the initial choice of the entries in and (ref. Fig. 1(b)).

The experimental results have been compared with the conventional distributed gradient descent (DGD) [yuan2016convergence] method and the accelerated projection-based consensus (APC) [azizan2019distributed] method (ref. Fig. 34). We could not find any existing preconditioning technique for the distributed least squares problem (1) in the literature to compare with. We compare the time needed by these three algorithms to reach a relative estimation error (defined as ) of . The tuning parameter has been set at for Algorithm 1. Rest of the algorithm parameters have been set such that the respective algorithms will have their smallest possible convergence rates. Specifically, for Algorithm 1, for APC and for DGD. We found that, Algorithm 1 and APC respectively need and seconds to reach relative estimation error , while DGD takes seconds to attain relative error . The recorded times are average over five runs of the corresponding algorithms. Note that, evaluating the optimal tuning parameters for any of these algorithms requires knowledge about the smallest and largest Eigenvalues of . Clearly, Algorithm 1 performs much faster than DGD and has speed of convergence similar to APC. However, we have formally shown Algorithm 1 to be faster than the DGD algorithm, unlike the APC algorithm which has only been speculated to perform faster.

We notice that, the error norm for Algorithm 1 is less than that of DGD from an approximate iteration index onward (ref. Fig. 2(b)). Since is same for both of the algorithms, this observation is in agreement with the claim in Theorem 1. In case of the optimal algorithm parameters, APC is faster than Algorithm 1 up to a certain time, seconds where the relative error is approximately , but after that Algorithm 1 decreases faster (ref. Fig. 3(b)).

Iv Discussions

In this paper, we proposed an algorithm that distributively solves linear least-squares minimization problems over a server-machine network architecture. The novelty of the proposed method lies in incorporating a time-varying pre-conditioning technique into the classical distributed gradient descent method in such a way that has been theoretically shown to boost convergence speed. The computation of the pre-conditioner is done at the server level without requiring any access to the data. In practice, we test the algorithm on a real data-set and get much better performance compared to the classical distributed gradient algorithm regarding the time needed for convergence to the true solution. Although an existing accelerated algorithm performs similar to the proposed method on this problem regarding the speed of convergence, it is not guaranteed theoretically. We, however, has formally shown the proposed algorithm to converge faster than the distributed gradient method.

References

Appendix

Iv-a Proof of Lemma 1

From (8) and noticing that , dynamics (9) can be rewritten as

(11)

Define . From (11),

(12)

Since is positive definite for , for which there is a positive such that , , where the smallest value of is  [fessler2008image]. As , the claim follows.

Iv-B Proof of Proposition 1

Consider the Eigen decomposition , where

is the Eigenvector matrix of

and is a diagonal matrix with the Eigenvalues of in the diagonal. Since and the identity matrix can be written as , we have

where are the Eigenvalues of the positive definite matrix . Since , the claim follows.

Iv-C Proof of Proposition 2

Using (2) and observing that , dynamics (10) can be rewritten as

(13)

Define . From (13),

(14)

where comprises of the columns , and . From Lemma 1, it follows that

(15)

Now,

(16)

where the last inequality follows from (15). From (14) and (16),

(17)

Since is positive definite for (from Proposition 1), for which and (17) holds. The smallest value of is .

Pick any . As for any , for any such that . Then

(18)

Then with the largest and the smallest Eigenvalues of obtained in Proposition 1, the claim follows.

Iv-D Proof of Theorem 1

From Proposition 2, for any such that , . Since this holds for any , we can say that such that ,

(19)

where depends only on and independent of . The inequality on the right hand side follows from the first statement in Proposition 2. For DGD, we have ,

(20)

It can be easily checked that for . From (19) and (20), we have ,

where and . Since and do not depends on and , such that . This completes the proof.