 This paper proposes a new gradient method to solve the large-scale problems. Theoretical analysis shows that the new method has finite termination property for two dimensions and converges R-linearly for any dimensions. Experimental results illustrate first the issue of parallel implementation. Then, the solution of a large-scale problem shows that the new method is better than the others, even competitive with the conjugate gradient method.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We are interested in investigating new gradient methods for the solution of linear system

 Ax=b, (1)

where is symmetric positive definite (SPD) and . This problem is equivalent to the minimization of a convex quadratic function

 f(x)=12x⊺Ax−b⊺x. (2)

Gradient methods generate a sequence of the form

 xk+1=xk−αkgk,k=0,1,…, (3)

where . It is well known that the steepest descent (SD) method  performs poor in most cases, where the steplength can be written as follows

 αSDk=g⊺kgkg⊺kAgk. (4)

The iterates generated tend to asymptotically alternate between two directions . In contrast, the conjugate gradient (CG) method  is often the method of choice that will terminate in at most iterations. It is very attractive because of its high efficiency and low storage requirement. Nonetheless, CG iteration depends strongly on the search of direction calculation, i.e., any derivation such as round-off errors can seriously degrade performance .

In the past several decades, a renewed interest for gradient methods has appeared since Barzilai and Borwein  proposed two efficient nonmonotone steplengths

 αBB1k=g⊺k−1gk−1g⊺k−1Agk−1. (5)
 αBB2k=g⊺k−1Agk−1g⊺k−1A2gk−1. (6)

The motivation consists in approximating the Hessian and imposing some quasi-Newton properties. Some theories and experiments have shown that BB methods have good performance and are competitive with CG methods when low accuracy is required or small perturbation exists . The convergence has been proven by Raydan . Furthermore, Friedlander et al.  provided a general framework under the name of “gradient method with retards” that SD and BB both belong to it, as well as several alternate methods proposed later [4, 5, 21].

Motivated by the two-dimensional finite termination property, Yuan  provided a somewhat complicated steplength

 αYk=2 ⎷(1αSDk−1−1αSDk)2+4g⊺kgks⊺k−1sk−1+1αSDk−1+1αSDk, (7)

where . He gave two algorithms and some variants were investigated further by Dai and Yuan . Among these methods, the second variant (DY) is the most efficient one according to the experiments in , where iterates are generated of the form

 αDYk={αSDk,kmod4<2,αYk,otherwise. (8)

In this paper, we address the properties of cyclic gradient methods, especially their parallel behavior. We propose a new algorithm based on the Yuan steplength, which has also the two-dimensional finite termination property. In the next section, we introduce the cyclic gradient methods and propose our new steplength. In Section 3, we give the convergence results of the new method. Some numerical results are presented in Section 4. Finally, a concluding remark is shown in Section 5.

Friedlander et al.  proposed an ingenious framework that gives rise to a great number of potentially efficient algorithms. Firstly, assume that represents retard that allows to employ the information from previous iterations. Let

 ¯k=max{0, k−m}, (9)

then a collection of possible choices of steplength can be set as follows

 αGMRk=g⊺τ(k)Aρ(k)gτ(k)g⊺τ(k)Aρ(k)+1gτ(k), (10)

where

 τ(k)∈{¯k, ¯k+1, …, k−1, k}, (11)

and

 ρ(k)∈{q1, …, qm},qj≥0, (12)

where . The next theorem summarizes the convergence result in .

###### Theorem 1 (Friedlander et al., 1999).

Consider the linear system (1) with is SPD and , where is the exact solution. Consider the gradient method (3) being used to solve (1) and the steplength given by (10). Then the sequence converges to starting from any point .

For a proof of the above theorem, see . Incidentally, several potential algorithms were provided therein, including the first cyclic gradient method under the name of cyclic steepest descent (CSD) as suggested in , which can be summarized as follows

 αCSDk={αSDk,kmodm=0,αk−1,otherwise. (13)

Notice that if we choose and , then (10) becomes CSD method, which satisfies the Theorem 1. On the other hand, Dai  proposed a variant called cyclic Barzilai-Borwein (CBB) method. They suggested that

 αCBBk={αBB1k,kmodm=0,αk−1,otherwise. (14)

Similarly, if we choose and , then (10) becomes CBB method.

Although these methods greatly speed up the convergence, their motivation is too straightforward to further accelerate the iterations, which relies on the nonmonotone property to search the whole space without sink into any lower subspace spanned by eigenvectors

. This allows to reduce the gradient components more or less in the same asymptotic rate .

The recent literature showed that Yuan steplength may lead to efficient algorithms [20, 6]. All methods therein have two-dimensional finite termination property, i.e., if (8) is applied to a linear system in two-dimensional space, then the algorithm will terminate in at most 3 iterations. In general, such property seems not attractive in practice. However, experiments showed that they perform well in higher dimensions and are competitive with BB methods for large-scale problems .

Inspired by the Yuan steplength, we suggest a simple way of modifying steepest descent model to a cyclic gradient method. Consider a steplength of the form

 αYBk={αSDk,kmod3=0 % or 2,αYk,kmod3=1. (15)

Here we modify the order of SD and Y compared to the original YB formula, which is useful for the development of the new algorithm. Apart from this change, (15) is indeed the second algorithm propose by the pioneering work of Yuan . It keeps the two-dimensional finite termination property that performs as well as BB for large-scale problems and better for small-scale problems. We could introduce simply the cyclic behavior based on (15) of the form

 ∀m∈N, if kmod(3+m)>2, then αk=αk−1. (16)

Besides, we find that De Asmundis et al.  gives an interesting view about the iterations of SD method, where the technique of alignment was proposed therein to force the gradients into one-dimensional subspace and avoid the zigzag pattern. Notice that the inverse of constant Rayleigh quotient such as SD and BB steplengths has the similar property. Thus, constant SD with retards can also give rise to the alignment behavior and keep the nonmonotone benefit. To achieve this goal, we need to impose a repeat time to the zigzag process. Meanwhile, we want to keep the process based on Yuan steplength in the first several iterations. These motivations lead to a new method of the form

 αCYk=⎧⎪ ⎪⎨⎪ ⎪⎩αYk,kmod(l+m+2)=1,αSDk,kmod(l+m+2)

where and . Such formula seems complicated, but indeed easy to understand. There are three components consisting in (17): the first SD and Y are used to insure the finite termination property; the parameter acting on the second part of SD is used to keep several zigzag iterations; finally, the retard term induces alignment and provides nonmonotone behavior to leap from the lower subspace.

## 3 Convergence Analysis

By the invariance property under any orthogonal transformation, we can assume without loss of generality that

 A=diag(λ1, …, λn), (18)

where

 1=λ1≤⋯≤λn. (19)

We follow the convergence framework established by Dai  and adapt it to our method. Let

 G(k,μ)=μ∑i=1g2i,k, (20)

where is the th component of . A preliminary property is defined as follows.

###### Definition 2 (Property A).

Suppose that matrix has the form (18) with condition (19) holds. If , , such that , , ,

• ;

• if and , then ,

then the steplength has Property A.

The convergence framework of Dai can be deduced from Property A, stated as follows.

###### Theorem 3 (Dai, 2003).

Consider the linear system (1) with of the form (18) and . Consider the gradient method (3) being used to solve (1). If the steplength has Property A, then the sequence converges to 0 R-linearly for any starting point .

For a proof of the above theorem, see . Many gradient methods have Property A as mentioned in , e.g., the gradient method with retards (10). Inspired by the demonstration therein, we now develop a convergence result for the CY method.

###### Theorem 4.

Consider the linear system (1) with of the form (18) and . Consider the gradient method (3) with steplength (17) being used to solve (1). Then the steplength has Property A.

###### Proof.

Note that (17) has three alternate steplengths, whereas the SD updating process and the constant process using the last SD steplength both follow the framework (10), which has been proven to have the Property A . Therefore, we only investigate the Yuan steplength.

Recall that Yuan steplength has the following property

 ⎛⎝1αSDk−1+1αSDk⎞⎠−1<αYk

which is given in . Hence,

 λ1≤1αSDk<1αYk<1αSDk−1+1αSDk≤2λn. (22)

Then the first condition of Property A holds by setting . For the second one, let and , which yields . Suppose that

 G(k,μ)≤ϵ,g2μ+1,k≥M2ϵ, (23)

for all , and . Hence, the inverse of Yuan steplength becomes

 1αYk>1αSDk=g⊺kAgkg⊺kgk=∑ni=1λig2i,k∑ni=1g2i,k≥λμ+1∑ni=μ+1g2i,k∑μi=1g2i,k+∑ni=μ+1g2i,k≥λμ+1∑μi=1g2i,kg2μ+1,k+1≥λμ+1ϵ2ϵ+1=23λμ+1 (24)

Hence, the second condition of Property A is satisfied, which completes the proof. ∎

## 4 Numerical Results

We first address the issue of parallel implementation. The dot product is engaged in the computation of steplength, which is the major obstacle of parallelization. Here we have two strategies to realize this goal. Let be the band matrix stored in the

th processor. The first one (Gather Algorithm, GA) is to gather the vector

and execute dot product with global vectors, shown as follows

Then, the second one (Reduce Algorithm, RA) consists in computing the dot product locally, shown as follows

Besides, we can see that global gradient vector is used in each iteration that we must proceed another Allgatherv function to communicate with other processor.

Let be the number of processors. The two experiments are proceeded by Alinea  (see also, e.g., [17, 18, 16]) and JACK [13, 15] (see also, e.g., ) and results are illustrated in Figures 1 and 2. We can see that generally the results are not good because the first one imposes so much computation and communication load, while the second one causes indeed the problem of loss of precision. These problems exist in all projection methods and by now we have not yet managed to find a solution.

The second experiments are proceeded by Matlab R2017b with a large-scale problem provided by The SuiteSparse Matrix Collection , with and non-zero values. All parameters are chosen under a training problem given in , such that for CY, for CSD, and for CBB. Average results are shown in Table 1.

We use bold numbers indicating the most efficient algorithms under each residual threshold. Backslash represents a number of iterations more than . From Table 1, we see that the CY method performs better than other methods except CG. Although we do not yet beat CG in high precision, our method is still competitive in most cases. Specifically, we can see that CY is stable throughout the iterations and much better than CG when low accuracy is required.

## 5 Conclusions

In this paper, we have proposed a new gradient method and shown that it is very competitive with CG method and better than the others for large-scale problems. However, it is still lack of theoretical evidence supporting such results. It is also important to find a better parallelization strategy. Therefore it still remains to study the properties and high performance implementation of gradient methods.

## Acknowledgment

This work was supported by the French national programme LEFE/INSU and the project ADOM (Méthodes de décomposition de domaine asynchrones) of the French National Research Agency (ANR).

## References

•  H. Akaike.

On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method.

Annals of the Institute of Statistical Mathematics, 11(1):1–16, 1959.
•  J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141–148, 1988.
•  A. L. Cauchy. Méthode générale pour la résolution des systèmes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1):536–538, 1847.
•  Y.-H. Dai. Alternate step gradient method. Optimization, 52(4-5):395–415, 2003.
•  Y.-H. Dai and Y.-X. Yuan. Alternate minimization gradient method. IMA Journal of Numerical Analysis, 23(3):377–393, 2003.
•  Y.-H. Dai and Y.-X. Yuan. Analysis of monotone gradient methods. Journal of Industrial and Management Optimization, 1(2):181–192, 2005.
•  T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1):1:1–1:25, 2011.
•  R. De Asmundis, D. di Serafino, F. Riccio, and G. Toraldo. On spectral properties of steepest descent methods. IMA Journal of Numerical Analysis, 33(4):1416–1435, 2013.
•  R. Fletcher. On the Barzilai-Borwein method. In L. Qi, K. Teo, and X. Yang, editors, Optimization and Control with Applications, pages 235–256. Springer US, Boston, MA, 2005.
•  A. Friedlander, J. M. Martínez, B. Molina, and M. Raydan. Gradient method with retards and generalizations. SIAM Journal on Numerical Analysis, 36(1):275–289, 1999.
•  M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952.
•  F. Magoulès and A.-K. Cheik Ahamed. Alinea: An advanced linear algebra library for massively parallel computations on graphics processing units. The International Journal of High Performance Computing Applications, 29(3):284–310, 2015.
•  F. Magoulès and G. Gbikpi-Benissan. JACK: an asynchronous communication kernel library for iterative algorithms. The Journal of Supercomputing, 73(8):3468–3487, 2017.
•  F. Magoulès and G. Gbikpi-Benissan. Distributed convergence detection based on global residual error under asynchronous iterations. IEEE Transactions on Parallel and Distributed Systems, 29(4):819–829, 2018.
•  F. Magoulès and G. Gbikpi-Benissan. JACK2: An MPI-based communication library with non-blocking synchronization for asynchronous iterations. Advances in Engineering Software, 119:116–133, 2018.
•  F. Magoulès, G. Gbikpi-Benissan, and Q. Zou. Asynchronous iterations of Parareal algorithm for option pricing models. Mathematics, 6(4):1–18, 2018.
•  F. Magoulès, D. B. Szyld, and C. Venet. Asynchronous optimized Schwarz methods with and without overlap. Numerische Mathematik, 137(1):199–227, 2017.
•  F. Magoulès and C. Venet. Asynchronous iterative sub-structuring methods. Mathematics and Computers in Simulation, 145:34–49, 2018.
•  M. Raydan. On the Barzilai and Borwein choice of steplength for the gradient method. IMA Journal of Numerical Analysis, 13(3):321–326, 1993.
•  Y.-X. Yuan. A new stepsize for the steepest descent method. Journal of Computational Mathematics, 24(2):149–156, 2006.
•  B. Zhou, L. Gao, and Y.-H. Dai. Gradient methods with adaptive step-sizes. Computational Optimization and Applications, 35(1):69–86, 2006.