# Fast Gradient Methods with Alignment for Symmetric Linear Systems without Using Cauchy Step

The performance of gradient methods has been considerably improved by the introduction of delayed parameters. After two and a half decades, the revealing of second-order information has recently given rise to the Cauchy-based methods with alignment, which reduce asymptotically the search spaces in smaller and smaller dimensions. They are generally considered as the state of the art of gradient methods. This paper reveals the spectral properties of minimal gradient and asymptotically optimal steps, and then suggests three fast methods with alignment without using the Cauchy step. The convergence results are provided, and numerical experiments show that the new methods provide competitive and more stable alternatives to the classical Cauchy-based methods. In particular, alignment gradient methods present advantages over the Krylov subspace methods in some situations, which makes them attractive in practice.

## Authors

• 12 publications
• 28 publications
10/26/2020

### Convergence Acceleration via Chebyshev Step: Plausible Interpretation of Deep-Unfolded Gradient Descent

Deep unfolding is a promising deep-learning technique, whose network arc...
02/20/2020

### Second Order Optimization Made Practical

Optimization in machine learning, both theoretical and applied, is prese...
02/05/2021

### In-Loop Meta-Learning with Gradient-Alignment Reward

At the heart of the standard deep learning training loop is a greedy gra...
11/20/2017

### Fast BTG-Forest-Based Hierarchical Sub-sentential Alignment

In this paper, we propose a novel BTG-forest-based alignment method. Bas...
07/08/2014

### Regression-Based Image Alignment for General Object Categories

Gradient-descent methods have exhibited fast and reliable performance fo...
06/14/2021

### NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep Learning

In this paper, a novel second-order method called NG+ is proposed. By fo...
12/03/2015

### Kalman-based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning

Modern proximal and stochastic gradient descent (SGD) methods are believ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the linear system

 Ax=b, (1)

where is symmetric positive definite (SPD) and . The solution is the unique global minimizer of strictly convex quadratic function

 f(x)=12x⊺Ax−b⊺x. (2)

The gradient method is of the form

 xn+1=xn−αngn,n=0,1,…, (3)

where . The steepest descent (SD) method, originally proposed in [4], defined the steplength by the reciprocal of a Rayleigh quotient of Hessian matrix

 αSDn=g⊺ngng⊺nAgn, (4)

which is also called Cauchy steplength. It minimizes the function or the -norm error and gives theoretically an optimal result in each step

 αSDn=argminαf(xn−αgn)=argminα∥(I−αA)en∥2A,

where . This classical method is known to behave badly in practice. The directions generated tend to asymptotically alternate between two orthogonal directions leading to a slow convergence [1].

The first gradient method with retards is the Barzilai-Borwein (BB) method that was originally proposed in [3]. The BB method is of the form

 αBBn=g⊺n−1gn−1g⊺n−1Agn−1,

which remedies the convergence issue for ill-conditioned problems by using nonmonotone steplength. The motivation arose in providing a two-point approximation to the quasi-Newton methods, namely

 αBBn=argminα∥∥∥1αΔx−Δg∥∥∥2,

where and . Notice that . There exists a similar method developed by symmetry in [3]

 αBB2n=g⊺n−1Agn−1g⊺n−1A2gn−1,

which imposes as well a quasi-Newton property

 αBB2n=argminα∥Δx−αΔg∥2.

We remark that , see Section 2. Practical experience is generally in favor of BB. The convergence analysis of these methods was given in [29] and [7]. The preconditioned version was established in [25]. A more recent chapter by [15] discussed the efficiency of BB. In the years that followed numerous generalizations have appeared, such as alternate methods [5, 9], cyclic methods [18, 5, 6], adaptive methods [36, 17], and some general frameworks [18, 5, 35].

There exist several auxiliary steplengths acting as accelerators of other methods. More precisely, performing occasionally the auxiliary iterative steps could often improve the global convergence. For example, in order to find the unique minimizer in finitely many iterations in -dimensions, [34] proposed a ingenious steplength as follows

which is called Yuan steplength. Recently, [14] proposed a new gradient method that exploits also the spectral properties of SD. The improvement resorts to a special steplength

 αAn=⎛⎝1αSDn−1+1αSDn⎞⎠−1.

In one direction, these steplengths give rise to some efficient gradient methods. For example, [10] provided several alternate steps, in which we mention here the second variant

 αDYn={αSDn,nmod4=0 % or 1,αYn,otherwise,

which seems to be the most promising variant according to the experiments. As usual, it does not have a specific name. Here we call it Dai-Yuan (DY) method [17]. A closer examination of Yuan variants revealed that they have a distinguish property called “decreasing together” [10]

. It means that DY does not sink into any lower subspace spanned by eigenvectors. Experiments have shown that BB has also such feature. Important differences come from the fact that BB is a nonmonotone steplength, whereas DY is monotone thus being more stable.

On the other hand, the auxiliary steps lead to gradient methods with alignment such as

 αSDAn=⎧⎪ ⎪⎨⎪ ⎪⎩αSDn,nmod(d1+d2)

with . This method is called steepest descent with alignment (SDA). Here, we choose the version described in [13] without using the switch condition illustrated in [14], and vary the form while leaving the alignment property unchanged. Shortly after, they presented another similar method based on Yuan steplength [12], called steepest descent with constant steplength (SDC) which is of the form

 αSDCn=⎧⎪ ⎪⎨⎪ ⎪⎩αSDn,nmod(d1+d2)

with . The main feature of this method is to foster the reduction of gradient components along the eigenvectors of selectively, and reduce the search space into smaller and smaller dimensions. The problem tends to have a better and better condition number [12]. We note that the motivations of SDA and SDC are different according to [14] and [12]. Since their derivations both involve spectral analysis of Cauchy step, we define here that both of them are regarded as alignment methods. These two steps seem to be the state of the art of gradient methods and tend to give the best performance among all of these. Recently, [19] introduced a general framework of Cauchy steplength with alignment, which breaks the Cauchy cycle by periodically applying some short steplengths.

Despite the good practical performance of alignment methods, all promising formulations are based on the Cauchy steplength in order to ensure the alignment feature. It is convenient to relax such restriction and jump out of the framework. In this paper, we address this issue and investigate some gradient methods with the alignment property without Cauchy steplength. In Section 2, we analyze the spectral properties of minimal gradient step. In Section 3, we introduce some new gradient methods by virtue of the basic steplengths and discuss their alignment property. In Section 4, we focus on the convergence analysis of the new methods. A set of numerical experiments is illustrated in Section 5 and concluding remarks are drawn in Section 6.

## 2 Spectral analysis of minimal gradient

The minimal gradient (MG) method was proposed in [23] which is of the form

 αMGn=g⊺nAgng⊺nA2gn.

It minimizes the -norm gradient value

 αMGn=argminα∥gn−αAgn∥2,

where

denotes the Euclidean norm of a vector. Traditionally it does not have a specific name. From

[22] we know that it was originally called “minimal residues”. However, this term might cause confusion since there exists a Krylov subspace method called MINRES [27] which minimizes the norm of the residual through the Lanczos process. On the other hand, MG is also a special case of the Orthomin() method when  [20], and thus sometimes called OM [2, 33]. Here, the name “minimal gradient” comes from [9] since it gives an optimal gradient result in each step.

We can assume without loss of generality that

 0<λ1≤⋯≤λN,

where

is the set of eigenvalues of

, and is the set of associated eigenvectors. Let be the condition number of such that

 κ=λNλ1. (5)

From (3) we can deduce that

 gn+1=(I−αnA)gn. (6)

There exist real numbers such that

 gn=N∑i=1ζi,nvi. (7)

Then, substituting (7) into (6) implies

 ζi,n+1=(1−αnλi)ζi,n.

We know from [1] that the SD method is asymptotically reduced to a search in the -dimensional subspace generated by the two eigenvectors corresponding to the largest and the smallest eigenvalues of . Eventually the directions generated tend to zigzag in two orthogonal directions that gives rise to a slow convergence rate. Such argument was demonstrated by using the following lemma, see [1] and [16] for more details.

###### Lemma 1.

Let

be a probability measure attached to

where and . Consider a transformation such that

 pi,n+1=(∑Nj=1λjpj,n−λi)2N∑l=1(N∑j=1λjpj,n−λl)2pl,npi,n.

Then,

 limn→∞pi,2n=⎧⎨⎩p∗,i=1,0,i∈{2,…,N−1},1−p∗,i=N,

and

 limn→∞pi,2n+1=⎧⎨⎩1−p∗,i=1,0,i∈{2,…,N−1},p∗,i=N,

for some .

We now give our main result on the spectral properties of MG. These arguments lead to the gradient methods with alignment which shall be described in Section 3.

###### Theorem 2.

Consider the linear system where is SPD and . Assume that the sequence of solution vectors is generated by the MG method. If and the starting point is such that and , then for some constant , the following results hold

1.  limn→∞λiζ2i,2n∑Nj=1λjζ2j,2n=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩11+c2,i=1,0,i∈{2,…,N−1},c21+c2,i=N, (8)
 limn→∞λiζ2i,2n+1∑Nj=1λjζ2j,2n+1=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩c21+c2,i=1,0,i∈{2,…,N−1},11+c2,i=N; (9)
2.  limn→∞αMG2n=1+c2λ1(1+c2κ), (10)
 limn→∞αMG2n+1=1+c2λ1(c2+κ); (11)
3.  (12)
4.  limn→∞g⊺2n+1Ag2n+1g⊺2nAg2n=c2(κ−1)2(1+c2κ)2, (13)
 limn→∞g⊺2n+2Ag2n+2g⊺2n+1Ag2n+1=c2(κ−1)2(c2+κ)2. (14)
###### Proof.

We first prove (8) and (9). We have

 ζi,n+1=(1−αMGnλi)ζi,n,

Together with (7), this implies that

 ζi,n+1=⎛⎝1−∑Nj=1λjζ2j,n∑Nj=1λ2jζ2j,nλi⎞⎠ζi,n.

For any and , let us write , it follows that

 ^pi,n+1=⎛⎝1−∑Nj=1^pj,n∑Nj=1λj^pj,nλi⎞⎠2^pi,n. (15)

Moreover, we define a probability measure

 pi,n=^pi,n∑Nj=1^pj,n, (16)

from which we notice that . Hence,

 pi,n+1=⎛⎝∑Nj=1λjpj,n−λi∑Nj=1λjpj,n⎞⎠2^pi,n∑Nl=1^pl,n+1.

Notice that in Lemma 1 can be expressed as without loss of generality. Substituting (15) and applying again (16), it follows that

 pi,n+1=(∑Nj=1λjpj,n−λi)2N∑l=1(N∑j=1λjpj,n−λl)2pl,npi,n.

Along with Lemma 1 the desired result follows.

For the argument (b), notice that

 αMGn=1∑Nj=1λjpj,n.

Since argument (a) has been proved, relations (10) and (11) trivially follow by applying (8) and (9).

Then we prove the argument (c). For any , it follows from (6) that

 ∥gn+1∥2∥gn∥2=∑Nj=1(1−αMGnλj)2ζ2j,n∑Nj=1ζ2j,n.

Combining (8) and (10) implies

 limn→∞∥g2n+1∥2∥g2n∥2=(1−1+c21+c2κ)2λ−1111+c2+(1−(1+c2)κ1+c2κ)2λ−1Nc21+c2λ−1111+c2+λ−1Nc21+c2=(κ−1)2c4κ+(κ−1)2c2(c2+κ)(1+c2κ)2.

After some simplification, we can obtain (12) when the number of iteration is even in denominator. In an analogous fashion, combining (9) and (11) yields

 limn→∞∥g2n+2∥2∥g2n+1∥2=(1−1+c2c2+κ)2λ−11c21+c2+(1−(1+c2)κc2+κ)2λ−1N11+c2λ−11c21+c2+λ−1N11+c2=(κ−1)2c2κ+(κ−1)2c4(c2+κ)2(1+c2κ).

One finds that the final result of the odd case converges also to the same limit, which is the desired conclusion.

Finally, for the argument (d), we can similarly combine (8) and (10), which implies

 limn→∞g⊺2n+1Ag2n+1g⊺2nAg2n=(1−1+c2λ1(1+c2κ)λ1)211+c2+(1−1+c2λ1(1+c2κ)λN)2c21+c2=c4(κ−1)2+c2(κ−1)2(1+c2κ)2(1+c2).

Repeating this process for another case by using (9) and (11) yields

 limn→∞g⊺2n+2Ag2n+2g⊺2n+1Ag2n+1=(1−1+c2λ1(c2+κ)λ1)2c21+c2+(1−1+c2λ1(c2+κ)λN)211+c2=c2(κ−1)2+c4(κ−1)2(c2+κ)2(1+c2).

After some simplification, we can obtain (13) and (14). This completes our proof. ∎

###### Remark.

The assumption used in Theorem 2 is not restrictive since if there exist some repeated eigenvalues, then we can choose the corresponding eigenvectors so that the superfluous ones vanish [15]. Moreover, if or equals zero, then the second condition can be simply replaced by the components involving inner indices without changing the results discussed later on.

Note that argument (a) in Theorem 2 has been proved in [28] for a framework called -gradient algorithms, while results (b) to (d) for the MG method have not appeared in any literature. (b) shows that MG has also the zigzag behavior, namely, alternates between two directions. The implications for Theorem 2 shall be seen later in Section 3. For now, we give the asymptotic behavior of the quadratic function for completeness.

###### Theorem 3.

Under the assumptions of Theorem 2, the following results hold

 limn→∞f(x2n+1)−f(x∗)f(x2n)−f(x∗)=c2(1+c2κ2)(κ−1)2(c2+κ2)(1+c2κ)2, (17)
 limn→∞f(x2n+2)−f(x∗)f(x2n+1)−f(x∗)=c2(c2+κ2)(κ−1)2(1+c2κ2)(c2+κ)2, (18)

and

 limn→∞f(x2n+2)−f(x∗)f(x2n)−f(x∗)=limn→∞∥gn+1∥4∥gn∥4. (19)
###### Proof.

For any , it follows from (2) that

 f(xn+1)−f(x∗)f(xn)−f(x∗)=1+(g⊺nAgn)3(g⊺nA−1gn)(g⊺nA2gn)2−2(g⊺nAgn)(g⊺ngn)(g⊺nA2gn)(g⊺nA−1gn).

Let us write as defined in (15) and (16), in which case we obtain

 f(xn+1)−f(x∗)f(xn)−f(x∗)=1+1(∑Nj=1λ−2jpj,n)(∑Nj=1λjpj,n)2−2∑Nj=1λ−1jpj,n(∑Nj=1λ−2jpj,n)(∑Nj=1λjpj,n).

If is an even number, from (8), one finds that

 limn→∞f(xn+1)−f(x∗)f(xn)−f(x∗)=1+1(κ2+c21+c2)(1+κc2κ(1+c2))2−2(κ+c21+c2)(κ2+c21+c2)(1+κc2κ(1+c2))=κ4c4−2κ3c4+κ2c4+κ2c2−2κc2+c2(c2+κ2)(1+c2κ)2.

Notice that

 κ4c4−2κ3c4+κ2c4+κ2c2−2κc2+c2=c2(1+c2κ2)(κ−1)2,

which yields the first equation. Similarly, if is an odd number, it follows that

 limn→∞f(xn+1)−f(x∗)f(xn)−f(x∗)=1+1(κ2c2+11+c2)(c2+κκ(1+c2))2−2(κc2+11+c2)(κ2c2+11+c2)(c2+κκ(1+c2))=κ2c4−2κc4+c4+κ4c2−2κ3c2+κ2c2(c2κ2+1)(c2+κ)2.

The numerator can be merged as follows

 κ2c4−2κc4+c4+κ4c2−2κ3c2+κ2c2=c2(c2+κ2)(κ−1)2,

which yields the second result. Finally, (19) follows immediately by combining (17), (18) and (12). This completes our proof. ∎

## 3 New alignment methods without Cauchy steplength

As far as we know, all existing gradient methods with alignment are based on the Cauchy steplength. After a further rearrangement of steps, [19] concludes that one could break the Cauchy cycle by periodically applying some short steplengths to accelerate the convergence of gradient methods. We show here that such condition is not necessary and several methods that potentially possess the same feature without Cauchy step can be derived.

[14] observed that a constant equal to  could lead to alignment property. Here we extend it to a more general case.

###### Theorem 4.

Consider the linear system (1) and the gradient method (3) with a positive constant steplength such that

 ^α≤2λ1+λN (20)

being used to solve (1). Then the sequence converges to for any starting point . Moreover, if equality holds, then

 limn→∞ζi,nζ1,n={0,i=2,3,…,N−1,ζN,0ζ1,0(−1)n,i=N; (21)

otherwise,

 limn→∞ζi,nζ1,n=0,i=2,3,…,N. (22)
###### Proof.

We have

 ^α≤2λ1+λN<2λN≤2αSDn.

By [30], it is easy to deduce that the sequence converges to with a steplength . Hence, the first statement holds. One finds that

 limn→∞ζi,nζ1,n=ζi,0ζ1,0limn→∞(1−^αλi1−^αλ1)n.

Let

 φi=1−^αλi1−^αλ1.

For (22) to be satisfied, one needs to impose the condition for all , which yields

 (λi+λ1)^α<2,(λi−λ1)^α>0.

The second one is obviously satisfied, while the first one leads to

 ^α<2λ1+λN.

If equality holds, then

 φi=λN+λ1−2λiλN−λ1,

It is clear that . Then the second statement trivially follows, which completes the proof. ∎

Note that leads to the trivial case , and thus the limit in both (21) and (22) equals . From Theorem 4 we find that condition (20) has a twofold effect: driving the alignment property when strict partial order holds, as shown in (22), and forcing the search into a two-dimensional space in the equal case, as shown in (21). It means that if there exist some steps asymptotically making the equality of (21) attainable, then it has similar tendency with the SD method, namely, alternating between two orthogonal directions. On the other hand, we can add a fractional factor to periodically break the cycle. This asymptotically yields a constant steplength strictly smaller than , leading to alignment process in the subsequent several iterations according to (22).

Recall that [8] proposed a gradient method of the form

 αAOn=∥gn∥∥Agn∥.

It asymptotically converges to the optimal steplength

 limn→∞αAOn=αOPT=2λ1+λN,

which minimizes the coefficient matrix

 αOPT=argminα∥I−αA∥.

Thus we call it asymptotically optimal (AO) method. Notice that the following relationship holds

 αMGn≤αAOn≤αSDn, (23)

which can be easily proved by the Cauchy-Schwarz inequality

It is known that AO generates monotone curve and often leads to slow convergence.

We observe that the limit of AO satisfies condition (22) and may potentially be improved by a cyclic breaking. For example, we can choose a shorter one to constantly align the gradient vector to the one-dimensional space spanned by . Let where . It follows that

 limn→∞~αn<2λ1+λN.

From Theorem 4, we observe that can asymptotically trigger the alignment behavior. Hence, we can write a new gradient method called AO with alignment (AOA) as follows

 αAOAn=⎧⎪ ⎪⎨⎪ ⎪⎩αAOn,nmod(d1+d2)

with . Important differences between SDA and AOA come from the fact that the Cauchy step in SDA zigzags itself in two orthogonal directions, while the AO step in AOA converges to a constant and the constant leads later to the same feature.

On the other hand, since the spectral properties of MG have been studied in Section 2, we are now prepared to propose our new methods based on them. We first give some notations

 αA2n=⎛⎝1αMGn−1+1αMGn⎞⎠−1,
 αY2n=2⎛⎜ ⎜⎝   ⎷⎛⎝1αMGn−1−1αMGn⎞⎠2+4g⊺nAgn(αMGn−1)2g⊺n−1Agn−1+1αMGn−1+1αMGn⎞⎟ ⎟⎠−1.

Note that Y2 has been proposed in [10] as a component of the 2-dimensional finite termination method.

###### Theorem 5.

Consider the linear system where is SPD and . Assume that the sequence of solution vectors is generated by the MG method. If and the starting point is such that and , then the following results hold

 limn→∞αA2n=1λ1+λN, (25)
 limn→∞αY2n=1λN, (26)

and

 limn→∞⎛⎝1αMGn−1αMGn−g⊺nAgn(αMGn−1)2g⊺n−1Agn−1⎞⎠=λ1λN. (27)
###### Proof.

The first conclusion follows immediately by combining (10) and (11). For the second argument, we have

 αY2n=2⎛⎜ ⎜⎝  ⎷(αA2n)−2−4αMGn−1αMGn+4g⊺nAgn(αMGn−1)2g⊺n−1Agn−1+(αA2n)−1⎞⎟ ⎟⎠−1.

By combining (10), (11), (13) and (14), it follows that

 limn→∞g⊺2n+2Ag2n+2(αMG2n+1)2g⊺2n+1Ag2n+1=limn→∞g⊺2n+1Ag2n+1(αMG2n)2g⊺2nAg2n=λ21c2(κ−1)2(1+c2)2.

Hence, one can see that

 limn→∞⎛⎜ ⎜⎝1αMGn−1αMGn−g⊺nAgn(αMGn−1)2g⊺n−1Agn−1⎞⎟ ⎟⎠=λ21(1+c2κ)(c2+κ)(1+c2)2−λ21c2(κ−1)2(1+c2)2,

which implies the second conclusion after some simplification. Further, along with (25), we have

 limn→∞αY2n=2(√(λ1+λN)2−4λ1λN+λ1+λN)−1=1λN.

This completes our proof. ∎

One may conclude from Theorem 5 that A2 and Y2 are similar to the auxiliary steplengths discussed in [14] and [12]. However, since MG has shorter steplength than SD, we expect that the former might be more smoother than the latter. After a substitution of labels, we are able to define MG with alignment (MGA) and MG with constant steplength (MGC) as follows

 αMGAn=⎧⎪ ⎪⎨⎪ ⎪⎩αMGn,nmod(d1+d2)
 αMGCn=⎧⎪ ⎪⎨⎪ ⎪⎩αMGn,nmod(d1+d2)

with . Recall that the motivation in