1 Introduction
We are interested in investigating new gradient methods for the solution of linear system
(1) |
where is symmetric positive definite (SPD) and . This problem is equivalent to the minimization of a convex quadratic function
(2) |
Gradient methods generate a sequence of the form
(3) |
where . It is well known that the steepest descent (SD) method [3] performs poor in most cases, where the steplength can be written as follows
(4) |
The iterates generated tend to asymptotically alternate between two directions [1]. In contrast, the conjugate gradient (CG) method [11] is often the method of choice that will terminate in at most iterations. It is very attractive because of its high efficiency and low storage requirement. Nonetheless, CG iteration depends strongly on the search of direction calculation, i.e., any derivation such as round-off errors can seriously degrade performance [9].
In the past several decades, a renewed interest for gradient methods has appeared since Barzilai and Borwein [2] proposed two efficient nonmonotone steplengths
(5) |
(6) |
The motivation consists in approximating the Hessian and imposing some quasi-Newton properties. Some theories and experiments have shown that BB methods have good performance and are competitive with CG methods when low accuracy is required or small perturbation exists [9]. The convergence has been proven by Raydan [19]. Furthermore, Friedlander et al. [10] provided a general framework under the name of “gradient method with retards” that SD and BB both belong to it, as well as several alternate methods proposed later [4, 5, 21].
Motivated by the two-dimensional finite termination property, Yuan [20] provided a somewhat complicated steplength
(7) |
where . He gave two algorithms and some variants were investigated further by Dai and Yuan [6]. Among these methods, the second variant (DY) is the most efficient one according to the experiments in [6], where iterates are generated of the form
(8) |
In this paper, we address the properties of cyclic gradient methods, especially their parallel behavior. We propose a new algorithm based on the Yuan steplength, which has also the two-dimensional finite termination property. In the next section, we introduce the cyclic gradient methods and propose our new steplength. In Section 3, we give the convergence results of the new method. Some numerical results are presented in Section 4. Finally, a concluding remark is shown in Section 5.
2 Cyclic Gradient Methods
Friedlander et al. [10] proposed an ingenious framework that gives rise to a great number of potentially efficient algorithms. Firstly, assume that represents retard that allows to employ the information from previous iterations. Let
(9) |
then a collection of possible choices of steplength can be set as follows
(10) |
where
(11) |
and
(12) |
where . The next theorem summarizes the convergence result in [10].
Theorem 1 (Friedlander et al., 1999).
For a proof of the above theorem, see [10]. Incidentally, several potential algorithms were provided therein, including the first cyclic gradient method under the name of cyclic steepest descent (CSD) as suggested in [4], which can be summarized as follows
(13) |
Notice that if we choose and , then (10) becomes CSD method, which satisfies the Theorem 1. On the other hand, Dai [4] proposed a variant called cyclic Barzilai-Borwein (CBB) method. They suggested that
(14) |
Similarly, if we choose and , then (10) becomes CBB method.
Although these methods greatly speed up the convergence, their motivation is too straightforward to further accelerate the iterations, which relies on the nonmonotone property to search the whole space without sink into any lower subspace spanned by eigenvectors
[9]. This allows to reduce the gradient components more or less in the same asymptotic rate [6].The recent literature showed that Yuan steplength may lead to efficient algorithms [20, 6]. All methods therein have two-dimensional finite termination property, i.e., if (8) is applied to a linear system in two-dimensional space, then the algorithm will terminate in at most 3 iterations. In general, such property seems not attractive in practice. However, experiments showed that they perform well in higher dimensions and are competitive with BB methods for large-scale problems [6].
Inspired by the Yuan steplength, we suggest a simple way of modifying steepest descent model to a cyclic gradient method. Consider a steplength of the form
(15) |
Here we modify the order of SD and Y compared to the original YB formula, which is useful for the development of the new algorithm. Apart from this change, (15) is indeed the second algorithm propose by the pioneering work of Yuan [20]. It keeps the two-dimensional finite termination property that performs as well as BB for large-scale problems and better for small-scale problems. We could introduce simply the cyclic behavior based on (15) of the form
(16) |
Besides, we find that De Asmundis et al. [8] gives an interesting view about the iterations of SD method, where the technique of alignment was proposed therein to force the gradients into one-dimensional subspace and avoid the zigzag pattern. Notice that the inverse of constant Rayleigh quotient such as SD and BB steplengths has the similar property. Thus, constant SD with retards can also give rise to the alignment behavior and keep the nonmonotone benefit. To achieve this goal, we need to impose a repeat time to the zigzag process. Meanwhile, we want to keep the process based on Yuan steplength in the first several iterations. These motivations lead to a new method of the form
(17) |
where and . Such formula seems complicated, but indeed easy to understand. There are three components consisting in (17): the first SD and Y are used to insure the finite termination property; the parameter acting on the second part of SD is used to keep several zigzag iterations; finally, the retard term induces alignment and provides nonmonotone behavior to leap from the lower subspace.
3 Convergence Analysis
By the invariance property under any orthogonal transformation, we can assume without loss of generality that
(18) |
where
(19) |
We follow the convergence framework established by Dai [4] and adapt it to our method. Let
(20) |
where is the th component of . A preliminary property is defined as follows.
Definition 2 (Property A).
The convergence framework of Dai can be deduced from Property A, stated as follows.
Theorem 3 (Dai, 2003).
For a proof of the above theorem, see [4]. Many gradient methods have Property A as mentioned in [4], e.g., the gradient method with retards (10). Inspired by the demonstration therein, we now develop a convergence result for the CY method.
Theorem 4.
Proof.
Note that (17) has three alternate steplengths, whereas the SD updating process and the constant process using the last SD steplength both follow the framework (10), which has been proven to have the Property A [4]. Therefore, we only investigate the Yuan steplength.
Recall that Yuan steplength has the following property
(21) |
which is given in [20]. Hence,
(22) |
Then the first condition of Property A holds by setting . For the second one, let and , which yields . Suppose that
(23) |
for all , and . Hence, the inverse of Yuan steplength becomes
(24) |
Hence, the second condition of Property A is satisfied, which completes the proof. ∎
4 Numerical Results
We first address the issue of parallel implementation. The dot product is engaged in the computation of steplength, which is the major obstacle of parallelization. Here we have two strategies to realize this goal. Let be the band matrix stored in the
th processor. The first one (Gather Algorithm, GA) is to gather the vector
and execute dot product with global vectors, shown as followsThen, the second one (Reduce Algorithm, RA) consists in computing the dot product locally, shown as follows
Besides, we can see that global gradient vector is used in each iteration that we must proceed another Allgatherv function to communicate with other processor.


Let be the number of processors. The two experiments are proceeded by Alinea [12] (see also, e.g., [17, 18, 16]) and JACK [13, 15] (see also, e.g., [14]) and results are illustrated in Figures 1 and 2. We can see that generally the results are not good because the first one imposes so much computation and communication load, while the second one causes indeed the problem of loss of precision. These problems exist in all projection methods and by now we have not yet managed to find a solution.
The second experiments are proceeded by Matlab R2017b with a large-scale problem provided by The SuiteSparse Matrix Collection [7], with and non-zero values. All parameters are chosen under a training problem given in [2], such that for CY, for CSD, and for CBB. Average results are shown in Table 1.
CG | 4251 | 5786 | 7535 | |||
---|---|---|---|---|---|---|
CY | 13 | 208 | 1153 | 4275 | 6181 | \ |
CSD | 1470 | \ | ||||
CBB | \ | \ | ||||
DY | 15 | 200 | \ | \ | ||
BB1 | \ | |||||
SD | \ | \ | \ | \ |
We use bold numbers indicating the most efficient algorithms under each residual threshold. Backslash represents a number of iterations more than . From Table 1, we see that the CY method performs better than other methods except CG. Although we do not yet beat CG in high precision, our method is still competitive in most cases. Specifically, we can see that CY is stable throughout the iterations and much better than CG when low accuracy is required.
5 Conclusions
In this paper, we have proposed a new gradient method and shown that it is very competitive with CG method and better than the others for large-scale problems. However, it is still lack of theoretical evidence supporting such results. It is also important to find a better parallelization strategy. Therefore it still remains to study the properties and high performance implementation of gradient methods.
Acknowledgment
This work was supported by the French national programme LEFE/INSU and the project ADOM (Méthodes de décomposition de domaine asynchrones) of the French National Research Agency (ANR).
References
-
[1]
H. Akaike.
On a successive transformation of probability distribution and its application to the analysis of the optimum gradient method.
Annals of the Institute of Statistical Mathematics, 11(1):1–16, 1959. - [2] J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA Journal of Numerical Analysis, 8(1):141–148, 1988.
- [3] A. L. Cauchy. Méthode générale pour la résolution des systèmes d’équations simultanées. Comp. Rend. Sci. Paris, 25(1):536–538, 1847.
- [4] Y.-H. Dai. Alternate step gradient method. Optimization, 52(4-5):395–415, 2003.
- [5] Y.-H. Dai and Y.-X. Yuan. Alternate minimization gradient method. IMA Journal of Numerical Analysis, 23(3):377–393, 2003.
- [6] Y.-H. Dai and Y.-X. Yuan. Analysis of monotone gradient methods. Journal of Industrial and Management Optimization, 1(2):181–192, 2005.
- [7] T. A. Davis and Y. Hu. The University of Florida sparse matrix collection. ACM Transactions on Mathematical Software, 38(1):1:1–1:25, 2011.
- [8] R. De Asmundis, D. di Serafino, F. Riccio, and G. Toraldo. On spectral properties of steepest descent methods. IMA Journal of Numerical Analysis, 33(4):1416–1435, 2013.
- [9] R. Fletcher. On the Barzilai-Borwein method. In L. Qi, K. Teo, and X. Yang, editors, Optimization and Control with Applications, pages 235–256. Springer US, Boston, MA, 2005.
- [10] A. Friedlander, J. M. Martínez, B. Molina, and M. Raydan. Gradient method with retards and generalizations. SIAM Journal on Numerical Analysis, 36(1):275–289, 1999.
- [11] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952.
- [12] F. Magoulès and A.-K. Cheik Ahamed. Alinea: An advanced linear algebra library for massively parallel computations on graphics processing units. The International Journal of High Performance Computing Applications, 29(3):284–310, 2015.
- [13] F. Magoulès and G. Gbikpi-Benissan. JACK: an asynchronous communication kernel library for iterative algorithms. The Journal of Supercomputing, 73(8):3468–3487, 2017.
- [14] F. Magoulès and G. Gbikpi-Benissan. Distributed convergence detection based on global residual error under asynchronous iterations. IEEE Transactions on Parallel and Distributed Systems, 29(4):819–829, 2018.
- [15] F. Magoulès and G. Gbikpi-Benissan. JACK2: An MPI-based communication library with non-blocking synchronization for asynchronous iterations. Advances in Engineering Software, 119:116–133, 2018.
- [16] F. Magoulès, G. Gbikpi-Benissan, and Q. Zou. Asynchronous iterations of Parareal algorithm for option pricing models. Mathematics, 6(4):1–18, 2018.
- [17] F. Magoulès, D. B. Szyld, and C. Venet. Asynchronous optimized Schwarz methods with and without overlap. Numerische Mathematik, 137(1):199–227, 2017.
- [18] F. Magoulès and C. Venet. Asynchronous iterative sub-structuring methods. Mathematics and Computers in Simulation, 145:34–49, 2018.
- [19] M. Raydan. On the Barzilai and Borwein choice of steplength for the gradient method. IMA Journal of Numerical Analysis, 13(3):321–326, 1993.
- [20] Y.-X. Yuan. A new stepsize for the steepest descent method. Journal of Computational Mathematics, 24(2):149–156, 2006.
- [21] B. Zhou, L. Gao, and Y.-H. Dai. Gradient methods with adaptive step-sizes. Computational Optimization and Applications, 35(1):69–86, 2006.
Comments
There are no comments yet.