 # On Extensions of Limited Memory Steepest Descent Method

We present some extensions to the limited memory steepest descent method based on spectral properties and cyclic iterations. Our aim is to show that it is possible to combine sweep and delayed strategies for improving the performance of gradient methods. Numerical results are reported which indicate that our new methods are better than the original version. Some remarks on the stability and parallel implementation are shown in the end.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Unconstrained nonlinear optimization is an exciting field and has been the subject of research for centuries. One of the most widely used method is the gradient-based iterative method

 xn+1=xn−αngn, (1)

where

is viewed as a steplength. Consider a strongly convex quadratic function in the form

 f(x)=12x⊺Ax−b⊺x, (2)

where is an -dimensional symmetric positive definite (SPD) matrix and is an -dimensional vector. The unique minimizer

 minx∈RNf(x)

can be obtained by proper gradient methods. In the past three decades, gradient methods have regained interest with the work of Barzilai and Borwein , in which a novel approach, the so-called Barzilai-Borwein method (BB), was proposed in the form

 αBBn=s⊺n−1sn−1s⊺n−1yn−1,

where and . This method is nonmonotone in regard to both the sequences  and  where denotes the -norm of vector . Many generalizations of BB can thus be considered, see  and the references therein.

In 2012, Fletcher 

proposed a sweep method called limited memory steepest descent (LMSD). During a sweep, spectral information is exploited in order to approximate the eigenvalues of the Hessian matrix

. It follows from (1) and (2) that

 gn+1=gn−αnAgn. (3)

Let denote the set of eigenvalues of  with and

denote the set of corresponding eigenvectors. We assume that

and notice that can de decomposed as

 gn=N∑i=1ζi,nvi,

where is the th spectral component of . It follows that

 ζi,n+1=(1−αnλi)ζi,n=n∏j=0(1−αjλi)ζi,0.

If there exists any , then the th spectral component vanishes for all subsequent iterations. In particular, if is a diagonal matrix, then can be denoted by , expressing the th component of

. Fletcher came up with the idea of using the back gradient vectors to compute the Ritz values, regarded as estimates of the eigenvalues of

. Consider the most recent back gradient vectors

 G=[gn−m,gn−m+1,…,gn−1].

The QR factorization of  yields an matrix  with orthonormal columns and an upper triangular matrix . By (3), we know that the columns of  lie in the Krylov subspace

 span{gn−m,Agn−m,…,Am−1gn−1}.

Hence, the orthogonalization can be regarded as a Lanczos process (see, e.g., ) which leads to a tridiagonal matrix

 T=Q⊺AQ. (4)

The eigenvalues of , namely, the Ritz values of Hessian matrix , can be easily solved, for instance by QL algorithm with implicit shift as mentioned in .

For general unconstrained optimization problems, is not available. The following equations provide another way to formulate this method:

 T=[R,rn]JR−1,J=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝α−1n−m−α−1n−m⋱⋱α−1n−1−α−1n−1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,

where and can be computed through the partially extended Cholesky factorization

 G⊺[G,gn]=R⊺[R,r]. (5)

It is noteworthy that in the case of minimizing quadratic function (2) and with , LMSD is mathematically equivalent to BB.

## 2 Extensions of LMSD method

Extensions to LMSD can be thought of by using techniques in the field of delayed gradient methods. The first approach is motivated by the spectral properties. Consider the Yuan steplength  in the form

 αYn=2αRAn+√(αRAn)2−4Γn,

where

 αRAn=1αSDn−1+1αSDn,Γn=1αSDn−1αSDn−∥gn∥2(α%SDn−1∥gn−1∥)2.

Under some assumptions without loss of generality, the following limit holds:

 limn→∞αYn=1λN.

We refer the reader to  for more details. On the other hand, the most basic gradient method is steepest descent (SD) which can be written in the form

 αSDn=argminαf(xn−αgn).

A method that relies on these two stepelengths was proposed in  which can be summarized as follows:

1. Execute several gradient iterations based on SD steplength.

2. Compute as a constant using data from the current and last iterations.

3. Execute several gradient iterations based on this constant steplength.

For the quadratic case, this method can foster the sequence  to approximate some largest eigenvalues, and thus reduce the search spaces into smaller and smaller dimensions (see ).

Our main idea is to consider the SD iterations being used for both fostering alignment and generating -dimensional subspaces. Here we report the main steps:

1. Execute gradient iterations based on SD steplength.

2. Compute as a constant using data from the current and last iterations.

3. Execute gradient iterations based on this constant steplength.

4. Compute Ritz values of the Hessian matrix.

5. Execute gradient iterations based on these Ritz values.

We observe that the algorithm outlined above is a combination of alignment method and LMSD. This might take advantage of both strategies and improvements might be expected. There exist other constant steplengths that could be exploited for the purpose of alignment, see  for more details.

Another line of extensions starts with the cyclic steplength

 αn=αn−1.

A theorem provided in  could prove its convergence when follows the framework of gradient method with retards. It was recently shown in [16, 20] that cyclic formulation may accelerate the iteration process both in terms of sequential and parallel computation. The main steps based on this strategy can be stated as follows:

1. Compute Ritz values of the Hessian matrix.

2. Execute gradient iterations based on these Ritz values.

Each Ritz value would be used for times for updating iterates. Only back gradient vectors will be stored for computing the next set of Ritz values.

The previous methods are called LMSD, LMSD with constant steplength (LMSDC), LMSD with retards (LMSDR), respectively. For LMSD, Fletcher  has further studied two questions: how can the first back gradient vectors be initialized and in what order should Ritz values be used for performing the gradient iterations within a sweep? He suggested to initialize one or more Ritz values and proceed in a greedy way, namely, in what follows one could use all back gradient vectors for constructing  until it is possible to use back vectors. For LMSDC, it is clear that SD steplengths must be employed in the first iterations. For LMSDR, the initialization remains the same as that of LMSD and thus could be carried out by greedy choices as mentioned above.

In , Ritz values are selected in decreasing order, which gives the best chance that the monotonicity could be retained in early steps. In addition, if nonmonotonicity is detected in terms of  or , then the current sweep is terminated. Obviously, these approaches can be adapted for LMSDC without additional operations. For LMSDR, since Ritz values are employed in iterations, the most reliable approach is to execute all these iterations in decreasing order with the monotonicity detector mentioned above.

## 3 Numerical Experiments

We implement these algorithms by a linear algebra programming library . QR factorization combining with (4) is used for computing the Ritz values. There are two reasons: the first one is our programming environment has enough working memory that can ensure the storage of several long vectors; the second one is the implementation based on Cholesky factorization suffers from numerical instabilities since can be numerically ill-conditioned, see  for more details. Hence, a positive definite quadratic function is used as test case. We choose the same problem as that in , which is based on an SPD matrix of size  with . Other eigenvalues are distributed in geometric progression with ratio . More specifically, the Hessian matrix can be written as follows:

 A=diag(1,1.414,…,724.077).

The right hand side  is selected as a random vector in which the elements range from  to . The stopping criterion is  and is fixed with zero.

In Figures 1 and 2 we show the history of residual and quadratic function, respectively.

We choose , , and . The figures show that the convergence behaviors of LMSD and its two variants are about the same. Although is nonmonotone throughout iterations, is monotonically decreased. Table 1 provides computation times, from which we find that LMSDC and LMSDR are significantly better than the original version.

This observation is consistent with our expectations, which shows better performance of cyclic iterations in terms of computation costs compared with classical iterations.

We give the results of nonmonotone termination within a sweep, shown in Table 1. We can see that LMSDC almost preserves monotonicity even without local termination criterion. On the other hand, we need to use the termination strategy mentioned above to provide a certain monotonicity for LMSD and LMSDR.

It is known that LMSD is mathematically equivalent to BB for  if nonmonotone sweeps are considered. However, such behavior may not occur in experiments which is subject to rounding errors. An example is shown in Figure 3.

Note that we use here the same Hessian matrix and right hand side as the preceding test. In the figure we observe that the convergence results of BB and LMSD are about the same in the beginning. From about iteration  we can see some significant difference in colors which indicates the effect of rounding errors. In Table 2 we highlight some points in the convergence history and notice that when the absolute difference of residuals even exceeds the second residual value. This observation is consistent with the two curves in Figure 3, which proves that BB and nonmonotone LMSD with  are not numerically equivalent but only mathematically equivalent.

## 4 Practical Considerations

Experience shows that computing Ritz values according to (4) can be a better choice in terms of stability at the expense of more arithmetic operations and extra long vectors. Fletcher  gave some remedies which consist of discarding the oldest back gradient vectors and recomputing matrix . These approaches were also reported in  as parts of the implementation. Note that cyclic formulation may amplify the rounding errors. As a result, we could not choose a large  in practice. On the other hand, a smooth reduction in residual can be expected for the alignment process, and in fact the local termination detector can be removed without changing monotonicity.

The matrix  with  can be factorized by parallel algorithms. For example, Golub and Van Loan  provided some classical strategies for parallel factorizations. One could employ asynchronous iterations (see ,e.g, [14, 12]) in an iterative methods for improving the parallel performance. However, it is difficult to handle the effect of dot product operations. For the communication-avoiding techniques, Demmel et al.  discussed parallel QR factorization for tall and skinny matrices which was compared with the classical Householder QR. The storage of long vectors can be avoided by using Cholesky factorization (5), which has also been discussed in the previous references. On the other hand, the communication cost is reduced by cyclic formulation in LMSDR, which may lead to better parallel performance.

We have already implemented the parallel variants by JACK [9, 11], which is an MPI-based communication library with several choices of convergence detectors  for both classical and asynchronous iterative computing. Experiments show that LMSDR often gives better convergence results. There exist other possibilities that one could implement parallel LMSD variants by overlapping communication phases with computations. Specifically, the bottleneck in these methods is the dot product operations, which can be tackled by synchronization-reducing techniques. We refer to  and the references therein for more detailed discussion on this topic.

## 5 Conclusions

We have presented two variants for the LMSD method, called LMSDC and LMSDR, respectively. The main ingredients that are exploited when formulating the new methods are alignment and cyclic strategies, which have been widely used in delayed gradient methods. The comparison on the problem used in  reveals that our new methods are superior in terms of computation time and generate similar convergence curves as the original version.

## Acknowledgment

This work was funded by the project ADOM (Méthodes de décomposition de domaine asynchrones) of the French National Research Agency (ANR).

## References

•  J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA J. Numer. Anal., 8(1):141–148, 1988.
•  R. De Asmundis, D. di Serafino, W. W. Hager, G. Toraldo, and H. Zhang. An efficient gradient method using the Yuan steplength. Comput. Optim. Appl., 59(3):541–563, 2014.
•  J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou. Communication-optimal parallel and sequential QR and LU factorizations. SIAM J. Sci. Comput., 34(1):A206–A239, 2012.
•  D. di Serafino, V. Ruggiero, G. Toraldo, and L. Zanni. On the steplength selection in gradient methods for unconstrained optimization. Appl. Math. Comput., 318:176–195, 2018.
•  R. Fletcher. A limited memory steepest descent method. Math. Program., 135(1):413–436, 2012.
•  A. Friedlander, J. M. Martínez, B. Molina, and M. Raydan. Gradient method with retards and generalizations. SIAM J. Numer. Anal., 36(1):275–289, 1999.
•  G. H. Golub and C. F. Van Loan. Matrix Computations. Johns Hopkins University Press, 4th edition, 2013.
•  F. Magoulès and A.-K. Cheik Ahamed. Alinea: An advanced linear algebra library for massively parallel computations on graphics processing units. Int. J. High Perform. Comput. Appl., 29(3):284–310, 2015.
•  F. Magoulès and G. Gbikpi-Benissan. JACK: An asynchronous communication kernel library for iterative algorithms. J. Supercomput., 73(8):3468–3487, 2017.
•  F. Magoulès and G. Gbikpi-Benissan. Distributed convergence detection based on global residual error under asynchronous iterations. IEEE Trans. Parallel Distrib. Syst., 29(4):819–829, 2018.
•  F. Magoulès and G. Gbikpi-Benissan. JACK2: An MPI-based communication library with non-blocking synchronization for asynchronous iterations. Adv. Eng. Softw., 119:116–133, 2018.
•  F. Magoulès, G. Gbikpi-Benissan, and Q. Zou. Asynchronous iterations of Parareal algorithm for option pricing models. Mathematics, 6(4):1–18, 2018.
•  F. Magoulès, F.-X. Roux, and G. Houzeaux. Parallel Scientific Computing. Wiley-ISTE, 2015.
•  F. Magoulès and C. Venet. Asynchronous iterative sub-structuring methods. Math. Comput. Simul., 145:34–49, 2018.
•  Y.-X. Yuan. A new stepsize for the steepest descent method. J. Comput. Math., 24(2):149–156, 2006.
•  Q. Zou and F. Magoulès. A new cyclic gradient method adapted to large-scale linear systems. In Proceedings of the 17th International Symposium on Distributed Computing and Applications to Business, Engineering and Science, pages 196–199, Wuxi, China, 2018. IEEE.
•  Q. Zou and F. Magoulès. Fast gradient methods with alignment for symmetric linear systems without using Cauchy step. preprint available at arXiv:1909.01479, 2019.
•  Q. Zou and F. Magoulès. Parameter estimation in the Hermitian and skew-Hermitian splitting method using gradient iterations. preprint available at arXiv:1909.01481, 2019.
•  Q. Zou and F. Magoulès. Recent developments in iterative methods for reducing synchronization. preprint available at arXiv:1912.00816, 2019.
•  Q. Zou and F. Magoulès. Reducing the effect of global synchronization in delayed gradient methods for symmetric linear systems. submitted.