Fast Gradient Methods with Alignment for Symmetric Linear Systems without Using Cauchy Step

by   Qinmeng Zou, et al.

The performance of gradient methods has been considerably improved by the introduction of delayed parameters. After two and a half decades, the revealing of second-order information has recently given rise to the Cauchy-based methods with alignment, which reduce asymptotically the search spaces in smaller and smaller dimensions. They are generally considered as the state of the art of gradient methods. This paper reveals the spectral properties of minimal gradient and asymptotically optimal steps, and then suggests three fast methods with alignment without using the Cauchy step. The convergence results are provided, and numerical experiments show that the new methods provide competitive and more stable alternatives to the classical Cauchy-based methods. In particular, alignment gradient methods present advantages over the Krylov subspace methods in some situations, which makes them attractive in practice.



There are no comments yet.


page 15


Convergence Acceleration via Chebyshev Step: Plausible Interpretation of Deep-Unfolded Gradient Descent

Deep unfolding is a promising deep-learning technique, whose network arc...

Second Order Optimization Made Practical

Optimization in machine learning, both theoretical and applied, is prese...

In-Loop Meta-Learning with Gradient-Alignment Reward

At the heart of the standard deep learning training loop is a greedy gra...

Fast BTG-Forest-Based Hierarchical Sub-sentential Alignment

In this paper, we propose a novel BTG-forest-based alignment method. Bas...

Regression-Based Image Alignment for General Object Categories

Gradient-descent methods have exhibited fast and reliable performance fo...

NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep Learning

In this paper, a novel second-order method called NG+ is proposed. By fo...

Kalman-based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning

Modern proximal and stochastic gradient descent (SGD) methods are believ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the linear system


where is symmetric positive definite (SPD) and . The solution is the unique global minimizer of strictly convex quadratic function


The gradient method is of the form


where . The steepest descent (SD) method, originally proposed in [4], defined the steplength by the reciprocal of a Rayleigh quotient of Hessian matrix


which is also called Cauchy steplength. It minimizes the function or the -norm error and gives theoretically an optimal result in each step

where . This classical method is known to behave badly in practice. The directions generated tend to asymptotically alternate between two orthogonal directions leading to a slow convergence [1].

The first gradient method with retards is the Barzilai-Borwein (BB) method that was originally proposed in [3]. The BB method is of the form

which remedies the convergence issue for ill-conditioned problems by using nonmonotone steplength. The motivation arose in providing a two-point approximation to the quasi-Newton methods, namely

where and . Notice that . There exists a similar method developed by symmetry in [3]

which imposes as well a quasi-Newton property

We remark that , see Section 2. Practical experience is generally in favor of BB. The convergence analysis of these methods was given in [29] and [7]. The preconditioned version was established in [25]. A more recent chapter by [15] discussed the efficiency of BB. In the years that followed numerous generalizations have appeared, such as alternate methods [5, 9], cyclic methods [18, 5, 6], adaptive methods [36, 17], and some general frameworks [18, 5, 35].

There exist several auxiliary steplengths acting as accelerators of other methods. More precisely, performing occasionally the auxiliary iterative steps could often improve the global convergence. For example, in order to find the unique minimizer in finitely many iterations in -dimensions, [34] proposed a ingenious steplength as follows

which is called Yuan steplength. Recently, [14] proposed a new gradient method that exploits also the spectral properties of SD. The improvement resorts to a special steplength

In one direction, these steplengths give rise to some efficient gradient methods. For example, [10] provided several alternate steps, in which we mention here the second variant

which seems to be the most promising variant according to the experiments. As usual, it does not have a specific name. Here we call it Dai-Yuan (DY) method [17]. A closer examination of Yuan variants revealed that they have a distinguish property called “decreasing together” [10]

. It means that DY does not sink into any lower subspace spanned by eigenvectors. Experiments have shown that BB has also such feature. Important differences come from the fact that BB is a nonmonotone steplength, whereas DY is monotone thus being more stable.

On the other hand, the auxiliary steps lead to gradient methods with alignment such as

with . This method is called steepest descent with alignment (SDA). Here, we choose the version described in [13] without using the switch condition illustrated in [14], and vary the form while leaving the alignment property unchanged. Shortly after, they presented another similar method based on Yuan steplength [12], called steepest descent with constant steplength (SDC) which is of the form

with . The main feature of this method is to foster the reduction of gradient components along the eigenvectors of selectively, and reduce the search space into smaller and smaller dimensions. The problem tends to have a better and better condition number [12]. We note that the motivations of SDA and SDC are different according to [14] and [12]. Since their derivations both involve spectral analysis of Cauchy step, we define here that both of them are regarded as alignment methods. These two steps seem to be the state of the art of gradient methods and tend to give the best performance among all of these. Recently, [19] introduced a general framework of Cauchy steplength with alignment, which breaks the Cauchy cycle by periodically applying some short steplengths.

Despite the good practical performance of alignment methods, all promising formulations are based on the Cauchy steplength in order to ensure the alignment feature. It is convenient to relax such restriction and jump out of the framework. In this paper, we address this issue and investigate some gradient methods with the alignment property without Cauchy steplength. In Section 2, we analyze the spectral properties of minimal gradient step. In Section 3, we introduce some new gradient methods by virtue of the basic steplengths and discuss their alignment property. In Section 4, we focus on the convergence analysis of the new methods. A set of numerical experiments is illustrated in Section 5 and concluding remarks are drawn in Section 6.

2 Spectral analysis of minimal gradient

The minimal gradient (MG) method was proposed in [23] which is of the form

It minimizes the -norm gradient value


denotes the Euclidean norm of a vector. Traditionally it does not have a specific name. From 

[22] we know that it was originally called “minimal residues”. However, this term might cause confusion since there exists a Krylov subspace method called MINRES [27] which minimizes the norm of the residual through the Lanczos process. On the other hand, MG is also a special case of the Orthomin() method when  [20], and thus sometimes called OM [2, 33]. Here, the name “minimal gradient” comes from [9] since it gives an optimal gradient result in each step.

We can assume without loss of generality that


is the set of eigenvalues of

, and is the set of associated eigenvectors. Let be the condition number of such that


From (3) we can deduce that


There exist real numbers such that


Then, substituting (7) into (6) implies

We know from [1] that the SD method is asymptotically reduced to a search in the -dimensional subspace generated by the two eigenvectors corresponding to the largest and the smallest eigenvalues of . Eventually the directions generated tend to zigzag in two orthogonal directions that gives rise to a slow convergence rate. Such argument was demonstrated by using the following lemma, see [1] and [16] for more details.

Lemma 1.


be a probability measure attached to

where and . Consider a transformation such that



for some .

We now give our main result on the spectral properties of MG. These arguments lead to the gradient methods with alignment which shall be described in Section 3.

Theorem 2.

Consider the linear system where is SPD and . Assume that the sequence of solution vectors is generated by the MG method. If and the starting point is such that and , then for some constant , the following results hold

  1. (8)
  2. (10)
  3. (12)
  4. (13)

We first prove (8) and (9). We have

Together with (7), this implies that

For any and , let us write , it follows that


Moreover, we define a probability measure


from which we notice that . Hence,

Notice that in Lemma 1 can be expressed as without loss of generality. Substituting (15) and applying again (16), it follows that

Along with Lemma 1 the desired result follows.

For the argument (b), notice that

Since argument (a) has been proved, relations (10) and (11) trivially follow by applying (8) and (9).

Then we prove the argument (c). For any , it follows from (6) that

Combining (8) and (10) implies

After some simplification, we can obtain (12) when the number of iteration is even in denominator. In an analogous fashion, combining (9) and (11) yields

One finds that the final result of the odd case converges also to the same limit, which is the desired conclusion.

Finally, for the argument (d), we can similarly combine (8) and (10), which implies

Repeating this process for another case by using (9) and (11) yields

After some simplification, we can obtain (13) and (14). This completes our proof. ∎


The assumption used in Theorem 2 is not restrictive since if there exist some repeated eigenvalues, then we can choose the corresponding eigenvectors so that the superfluous ones vanish [15]. Moreover, if or equals zero, then the second condition can be simply replaced by the components involving inner indices without changing the results discussed later on.

Note that argument (a) in Theorem 2 has been proved in [28] for a framework called -gradient algorithms, while results (b) to (d) for the MG method have not appeared in any literature. (b) shows that MG has also the zigzag behavior, namely, alternates between two directions. The implications for Theorem 2 shall be seen later in Section 3. For now, we give the asymptotic behavior of the quadratic function for completeness.

Theorem 3.

Under the assumptions of Theorem 2, the following results hold




For any , it follows from (2) that

Let us write as defined in (15) and (16), in which case we obtain

If is an even number, from (8), one finds that

Notice that

which yields the first equation. Similarly, if is an odd number, it follows that

The numerator can be merged as follows

which yields the second result. Finally, (19) follows immediately by combining (17), (18) and (12). This completes our proof. ∎

3 New alignment methods without Cauchy steplength

As far as we know, all existing gradient methods with alignment are based on the Cauchy steplength. After a further rearrangement of steps, [19] concludes that one could break the Cauchy cycle by periodically applying some short steplengths to accelerate the convergence of gradient methods. We show here that such condition is not necessary and several methods that potentially possess the same feature without Cauchy step can be derived.

[14] observed that a constant equal to  could lead to alignment property. Here we extend it to a more general case.

Theorem 4.

Consider the linear system (1) and the gradient method (3) with a positive constant steplength such that


being used to solve (1). Then the sequence converges to for any starting point . Moreover, if equality holds, then




We have

By [30], it is easy to deduce that the sequence converges to with a steplength . Hence, the first statement holds. One finds that


For (22) to be satisfied, one needs to impose the condition for all , which yields

The second one is obviously satisfied, while the first one leads to

If equality holds, then

It is clear that . Then the second statement trivially follows, which completes the proof. ∎

Note that leads to the trivial case , and thus the limit in both (21) and (22) equals . From Theorem 4 we find that condition (20) has a twofold effect: driving the alignment property when strict partial order holds, as shown in (22), and forcing the search into a two-dimensional space in the equal case, as shown in (21). It means that if there exist some steps asymptotically making the equality of (21) attainable, then it has similar tendency with the SD method, namely, alternating between two orthogonal directions. On the other hand, we can add a fractional factor to periodically break the cycle. This asymptotically yields a constant steplength strictly smaller than , leading to alignment process in the subsequent several iterations according to (22).

Recall that [8] proposed a gradient method of the form

It asymptotically converges to the optimal steplength

which minimizes the coefficient matrix

Thus we call it asymptotically optimal (AO) method. Notice that the following relationship holds


which can be easily proved by the Cauchy-Schwarz inequality

It is known that AO generates monotone curve and often leads to slow convergence.

We observe that the limit of AO satisfies condition (22) and may potentially be improved by a cyclic breaking. For example, we can choose a shorter one to constantly align the gradient vector to the one-dimensional space spanned by . Let where . It follows that

From Theorem 4, we observe that can asymptotically trigger the alignment behavior. Hence, we can write a new gradient method called AO with alignment (AOA) as follows


with . Important differences between SDA and AOA come from the fact that the Cauchy step in SDA zigzags itself in two orthogonal directions, while the AO step in AOA converges to a constant and the constant leads later to the same feature.

On the other hand, since the spectral properties of MG have been studied in Section 2, we are now prepared to propose our new methods based on them. We first give some notations

Note that Y2 has been proposed in [10] as a component of the 2-dimensional finite termination method.

Theorem 5.

Consider the linear system where is SPD and . Assume that the sequence of solution vectors is generated by the MG method. If and the starting point is such that and , then the following results hold




The first conclusion follows immediately by combining (10) and (11). For the second argument, we have

By combining (10), (11), (13) and (14), it follows that

Hence, one can see that

which implies the second conclusion after some simplification. Further, along with (25), we have

This completes our proof. ∎

One may conclude from Theorem 5 that A2 and Y2 are similar to the auxiliary steplengths discussed in [14] and [12]. However, since MG has shorter steplength than SD, we expect that the former might be more smoother than the latter. After a substitution of labels, we are able to define MG with alignment (MGA) and MG with constant steplength (MGC) as follows


with . Recall that the motivation in