# Parametrized Accelerated Methods Free of Condition Number

Analyses of accelerated (momentum-based) gradient descent usually assume bounded condition number to obtain exponential convergence rates. However, in many real problems, e.g., kernel methods or deep neural networks, the condition number, even locally, can be unbounded, unknown or mis-estimated. This poses problems in both implementing and analyzing accelerated algorithms. In this paper, we address this issue by proposing parametrized accelerated methods by considering the condition number as a free parameter. We provide spectral-level analysis for several important accelerated algorithms, obtain explicit expressions and improve worst case convergence rates. Moreover, we show that those algorithm converge exponentially even when the condition number is unknown or mis-estimated.

## Authors

• 4 publications
• 38 publications
• ### Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances

Momentum methods such as Polyak's heavy ball (HB) method, Nesterov's acc...
01/22/2019 ∙ by Bugra Can, et al. ∙ 0

• ### Boosting First-order Methods by Shifting Objective: New Schemes with Faster Worst Case Rates

We propose a new methodology to design first-order methods for unconstra...
05/25/2020 ∙ by Kaiwen Zhou, et al. ∙ 20

• ### Locally Accelerated Conditional Gradients

Conditional gradient methods form a class of projection-free first-order...
06/19/2019 ∙ by Alejandro Carderera, et al. ∙ 0

• ### Parameter-free Locally Accelerated Conditional Gradients

Projection-free conditional gradient (CG) methods are the algorithms of ...
02/12/2021 ∙ by Alejandro Carderera, et al. ∙ 0

• ### Accelerated Target Updates for Q-learning

This paper studies accelerations in Q-learning algorithms. We propose an...
05/07/2019 ∙ by Bowen Weng, et al. ∙ 0

• ### Statistically Preconditioned Accelerated Gradient Method for Distributed Optimization

We consider the setting of distributed empirical risk minimization where...
02/25/2020 ∙ by Hadrien Hendrikx, et al. ∙ 0

• ### On Linear Convergence of Weighted Kernel Herding

We provide a novel convergence analysis of two popular sampling algorith...
07/19/2019 ∙ by Rajiv Khanna, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Accelerated (momentum-based) gradient descent and its variants are arguably among the most popular optimization methods in modern machine learning. It is a workhorse of optimization for deep neural networks and achieves state-of-the-art results in a range of applications

[4, 10, 20].

Momentum-based algorithms are a class of first order iterative methods which use gradient evaluations from several previous iterations. These methods can be shown to reduce the number of iterations compared to ordinary gradient descent. There is an extensive literature on analyzing such accelerated schemes, notably Nesterov’s accelerated gradient descent (Nesterov’s AGD) (see, e.g., [13, 5], and references therein).

We note that most analyses of momentum-based accelerated methods assume strong convexity (bounded condition number ) to obtain exponential111Called linear in the optimization literature. convergence rates, i.e., , where is the number of iterations. Only much slower rates can be derived without that assumption [5]. Moreover the optimal choice of parameters for these accelerated methods depends explicitly on the condition number .

However, in many real problems, can be very large or even unbounded. For example, it can be shown that for smooth kernels grows nearly exponentially with the number of data points [19, 3]

. While neural networks are generally non-convex, Hessian matrices at minima appear to have many small eigenvalues resulting in high (local) condition numbers

[17]. The condition number is generally difficult to estimate. Such estimation is costly (potentially as expensive as full matrix inversion) and numerically unstable, requiring estimating the inverse of the smallest eigenvalue of a positive definite Hessian matrix. When is not known or mis-estimated, we generally have no guarantee for the validity of momentum-based methods. Moreover, if the condition number is known but very large, the exponential theoretical rate can still be very slow, and potentially requires more computation than the Newton’s method.

In this paper, our primary goal is to understand performance of momentum-based algorithms and their parameter selection, when the condition number is very large or unknown. To that end, whenever parameter choice for specific algorithms depends on knowing the condition number, we propose to parametrize the algorithms by treating those parameters as “free”. These parametrized algorithms should be proven to converge, for all choices of the parameter, in order to be validated.

To be able to do that and to simplify the analysis, we consider quadratic objective functions. This is an important case which allows for much precise analysis than general convex functions. Moreover, even when the objective function is non-convex but smooth, as is the case for many neural networks, it can be approximated by a quadratic function near any of its local minima.

Most previous analyses focus on the worst case convergence behavior of such momentum-based algorithms. However, convergence can be much faster depending on the specific spectral properties of the Hessian. Thus we expect spectral level analysis, providing rates for each individual eigenvalue, to be much more precise. Still, to the best of our knowledge, explicit spectral level representations for these algorithms (except for the Chebyshev semi-iterative method) are not found in the literature. A recent paper [14] explored spectral-level properties of Nesterov’s AGD, but still did not give an explicit expression for the spectral-level convergence rate.

In this paper, we study and provide explicit spectral analysis for three important momentum-based accelerated methods: Nesterov’s AGD, Chebyshev semi-iterative method (Chebyshev) and Second-order Richardson method (SOR). Nesterov’s AGD is very commonly used in practice and extensively analyzed [6, 2, 18]. The classical Chebyshev semi-iterative method [11, 9] has a number of optimality properties, while SOR [8, 16], also known as the heavy ball method [15], is the simplest fixed coefficient momentum scheme. In this paper, we collectively call the set of these three methods the accelerated class.

Our Contribution.

• In this work we give explicit spectral level representations for the accelerated class methods. As far as we know, these are the first explicit expressions for Nesterov’s AGD and SOR methods. We analyze and compare their convergence rates. In particular, we show that these algorithms converge exponentially for each eigenvalue, which improves the rate obtained in [14]. We also express their worst-case convergence guarantees in terms of what we call Chebyshev numbers, which can be computed explicitly. Interestingly, we observe that Nesterov’s AGD has the slowest worst case convergence rate among the accelerated class, and that none of the algorithms accelerate all scenarios.

• We show all of the accelerated class algorithms converge even when the condition number is mis-specified. We also see how their rates of convergence depend on the choice of the parameter, corresponding to the condition number. We also provide a comparison of these methods in the “beyond the condition number” non-strictly convex regime. Additionally we show that in that regime all of the accelerated class methods converge faster than ordinary gradient descent.

Organization. The paper is organized as follow: In section 2, we list some useful preliminaries and notations. In section 3, we briefly introduce the accelerated class algorithms. In section 4, we provide the explicit expressions of spectral-level convergence rate, and show basic but important observations. In section 5, we analyze the convergence behavior of the accelerated class methods, in the strongly convex setting. In section 6, we propose parametrized methods which do not require any assumption on or knowledge of condition number, and provide analysis and compare their convergence performance. We also discuss the effect of changing the acceleration parameter. Proofs of main theorems can be found in the Appendix.

## 2 Preliminaries and Notations

In this section, we introduce notation and some important background definitions and results, see, e.g. [1, 5, 9].

Consider the problem of minimizing a least square objective function:

 f(w)=12∥y−Xw∥2, w∈W, (1)

where is a Hilbert space, is a linear operator from to another Hilbert space and . is usually interpreted as weight in some literature.

The Hessian operator is positive definite, and hence is convex. The gradient of is

 ∇f(w)=X∗y−X∗Xw. (2)

If is further strictly convex, there would be a unique optimizer .

###### Definition 1 (Strong Convexity).

function is -strongly convex if it satisfies

 f(w)≥f(v)+⟨∇f(v),w−v⟩+α2∥w−v∥2, ∀w,v∈W. (3)
###### Definition 2 (Smoothness).

function is -smooth if it satisfies

 |f(w)−f(v)−⟨∇f(v),w−v⟩|≤β2∥w−v∥2, ∀w,v∈W. (4)

When both -strong convexity and -smoothness are satisfied, one can define the condition number .

###### Definition 3 (Operator Spectrum).

Let be a Hilbert space. Given an operator , the (operator) spectrum of is

 sp(T)={μ∈C|T−μI is not invertible}. (5)

Every is called an eigenvalue of .

###### Proposition 1.

When is self-adjoint, . If is further positive definite, .

###### Definition 4 (Chebyshev Polynomials).

The -th Chebyshev polynomial (of the first kind), denoted as , is defined as

 Ck(x)=⎧⎪ ⎪⎨⎪ ⎪⎩cos(kcos−1x),if |x|≤1,cosh(kcosh−1x),if x>1,(−1)kcosh(kcosh−1(−x)),if x<−1. (6)
###### Remark 1.

Note that is a polynomial of degree .

###### Proposition 2.

Chebyshev polynomials satisfy the following recursive relations

 Ck+1(x)=2xCk(x)−Ck−1(x),∀k≥1. (7)
###### Theorem 1.

Let is the set of all polynomials of degree with leading coefficient 1. Then

 minPk∈Πkmaxx∈[−1,1]|Pk(x)| (8)

has a unique optimizer .

## 3 Momentum-based Accelerated Methods

In this section, we introduce a few classical momentum schemes, under -strong convexity and

-smoothness conditions. For the moment, we also assume the condition number

is known.

Aiming at optimizing , defined in Eq.(1), first-order iterative methods (e.g., Gradient descent) utilize first-order (gradient) information of the objective function to iteratively approximate optimizer by an approximator .

Define error . Then the norm indicates how far we are away from the optimizer in the current iteration. Moreover, the excess risk can be expressed as

 f(wk)−f(w∗)=∥Xξk∥2. (9)

Define operator , where is the identity operator and is a scalar, called step size, to be chosen. In this paper, we always set , to avoid potential over-shooting issues. In addition, we introduce parameter .

It is easy to see that is self-adjoint. Since Hessian is positive definite, by Proposition 1 and the -smoothness condition, sp. It is also important to note that an eigen-space of is also an eigen-space of Hessian, since commutes with the Hessian , with an eigenvalue correspondence:

 μ⟷1−μH/β, (10)

where is the corresponding Hessian-eigenvalue.

By spectral mapping theorem, the spectrum of any polynomial in satisfies

 sp(P(B))=P(sp(B))⊆P([0,ρ]). (11)

Gradient Descent. The (plain) gradient descent algorithm, uses full gradient information, Eq.(2), to iteratively update the approximator , following the rule:

 wk+1=wk−η∇f(wk),with k∈N, (12)

It is not hard to see that the error satisfies

 ξk=Bkξ0,k∈N. (13)
###### Remark 2.

is a power of , hence a polynomial of degree in . In this paper, we also call the plain gradient descent as power method (also known as (first-order) Richardson or Landweber method in the literature).

It is well known that [5], for -strongly convex and -smooth objective functions, power method needs iterations to achieve an excess risk of , while the theoretical worst case lower bound is proven to be .

Then, it is natural to ask: with a gradient oracle, can we design a practical algorithm which uses only iterations, such that it converges faster? Or from another point of view, how can one make the excess risk as small as possible, for every ?

### 3.1 Acceleration Problem

To formulate the above questions, we consider a sequence of real-valued polynomials, where subscript indicates the degree of the polynomials, and let

 ξk=Pk(B)ξ0,k∈N. (14)

We aim at finding a “best” choice of the sequence , such that the excess risk is minimized for each k.

###### Remark 3.

The reason to consider such polynomials is that polynomials are a much richer space than monomials, as in Eq.(13), but the time complexity remains of the same order. However the memory requirements generally grow linearly with the degree of the polynomials. This can be addressed by considering polynomial families with short recurrence relations, e.g., Chebyshev polynomials.

Note that the excess risk depends on the matrix . Hence the optimal optimization method is dependent on the properties of the data, making different solutions optimal for different optimization problems. As we will see in Section 4, there exists no universal algorithm which is optimal for all optimization problems.

Hence, we first study a related data-independent problem. We introduce the following convenient quantity:

###### Definition 5 (Chebyshev Number).

Given a real number , we define the Chebyshev number of polynomial , which satisfies the condition , as

 Chρ(P)=maxλ∈[0,ρ]|P(λ)|. (15)

As will see below, the Chebyshev number measures the worst case convergence rate, which is data independent.

Optimization Problem. Given , find a sequence of polynomials such that

 P∗k=argminPkChρ(Pk),subject to Pk(1)=1. (16)
###### Remark 4.

The extra condition is necessary since when the gradient is zero and first-order algorithms can not update at all.

This optimization problem can be viewed as minimizing the excess risk in the worst case scenario, assuming -strong convexity and -smoothness on . This is formalized by the following theorem:

###### Proposition 3.
 f(wk)−f(w∗)≤Ch2(Pk)(f(w0)−f(w∗)), (17)

and the equality holds when , where is the maximizer of the corresponding Chebyshev number, in Eq.(15).

Since the inequality can be an equality, is the worst case convergence rate, correspondingly, the case when gives the slowest convergence.

### 3.2 Accelerated Class Methods

All the following algorithms explicitly rely on the assumptions of strong convexity and smoothness, and are proven [5] to use iterations to reach a excess risk of .

Chebyshev (Semi-Iterative) Method. By theorem 1, the solution of the optimization problem, Eq.(16), can be shown to be unique and has the form of ”normalized” Chebyshev polynomials222 Strictly speaking, , which is nothing else but just a scaled version of . [9, 7]:

 P∗k(x)=Ck(x/ρ)Ck(1/ρ),k∈N, (18)

Combined with Eq.(14), the recursive relation, Eq.(7), of Chebyshev polynomials allows us to compute , hence , recursively without storing earlier information except and . Thus, it is efficient in both time and space. The induced update rule for weight is

 wk+1=wk−γk+1η∇f(wk) +(γk+1−1)(wk−wk−1), k∈N; (19a) w−1=0. (19b)

with the coefficients determined by

 γk+1=1/(1−ρ2γk/4),for k≥2; (20a) γ1=1,γ2=2/(2−ρ2). (20b)

Second-order Richardson Iterative Method. SOR updates following the rule:

 wk+1=wk−c1η∇f(wk)+c2(wk−wk−1), k≥1; w1=w0−η∇f(w0). (21)

where are time-independent coefficients. The displacement between the last two history records is usually interpreted as momentum.

The analysis of Frankel and Young [8, 21] suggests the following coefficients choice333 For the case of quadratic objective functions.,

 c1=21+√1−ρ2:=γ,c2=γ−1. (22)

’s in Eq.(3.2) satisfy a recurrence relation:

 ξk+1=γBξk+(1−γ)ξk−1, k≥1;ξ1=Bξ0. (23)

It is important to note that the coefficient in Chebyshev method is time changing and , therefore, SOR can be viewed as the limiting case of the Chebyshev method.

Nesterov’s Accelerated Gradient Descent. Introduced by Nesterov in 1983 [12], Nesterov’s AGD iteratively updates the approximator as follows444 This is the constant parameter scheme. [13]:

 wk+1=uk−1β∇f(uk), (24a) uk+1=(1+√κ−1√κ+1)wk+1−√κ−1√κ+1wk. (24b)

Defining , one can find the recurrence relation:

 ξk+1=γ′Bξk+(1−γ′)Bξk−1, k≥1. (25)
###### Remark 5.

Basic algebraic computation shows that . When comparing to the definition of in Eq.(22), one note that the only difference is the absence of square on . This feature is essential leading to different performance than the other two accelerated methods, as will be seen in Section 4 and 5.

By induction, one can easily show that recurrence relations, Eq.(23) and (25), also imply polynomial-type relations as in Eq.(14). We call the corresponding polynomials and , respectively.

In the rest of this paper, we call the collection of Chebyshev, SOR and Nesterov’s AGD as the accelerated class methods/algorithms.

## 4 Spectral-level Representation

In this section, we look for explicit expressions of polynomial , for each member of the accelerated class. As will see, the value , taken at , would be interpreted as the (spectral-level) convergence rate in the corresponding eigen-space.

Spectral-level decomposition. Let be a sequence of real-valued polynomials. Suppose the error evolving under the algorithm obeys: , with being the initial error. Denote by the eigen-basis of , i.e. , where is the index set.

In terms of the eigen-basis, can be decomposed as . And since operator commutes with , each

is also an eigen-vector of

. Hence,

 ξ(i)k:=⟨ξk,ei⟩=Pk(μi)⟨ξ0,ei⟩=Pk(μi)ξ(i)0. (26)

Firstly, we see that eigen-components evolve independently from each other. Specifically, is determined by quantities only from its corresponding eigen-space: eigen-component of , scalar value of at point . Secondly, the spectral-level convergence rate in a particular eigen-space is measured by solely, with smaller value implying faster convergence. These facts allow us to analyze the algorithms in each eigen-space independently.

Based on these observations, we can reduce the problem of analyzing operator to the one of analyzing the scalar-valued polynomial on instead, which is a simpler problem.

Spectral-level convergence rates. To analyze the polynomials , we will derive their explicit expressions first.

Recall that we already have explicit expressions for power and Chebyshev methods, as in Eqs. (13) and (18), respectively. But, to the best of our knowledge, explicit expressions for SOR and Nesterov’s AGD methods (corresponding to the recurrence relations Eqs. (23) and (25) are not found in the literature. Below we derive the explicit expressions of polynomials and .

For the purpose of simplifying expressions, we introduce the following notations: and , when ; and and , when ; and also and . By utilizing the technique of solving linear difference equations, we can solve the recurrence relations, and have the following theorem:

###### Theorem 2 (Explicit expressions for Rk and Nk).

If the algorithm obeys the recurrence relation in Eq.(23) or (25), then or , respectively, where and have the following analytic expressions on interval

 Rk(μ)=exp(−kΔ)×⎧⎪⎨⎪⎩tanhΔcotθsinkθ+coskθ,μ∈[0,ρ),ktanhΔ+1,μ=ρ,tanhΔcothΘsinhkΘ+coshkΘ,μ∈(ρ,1]. (27a) Nk(μ)=μk/2exp(−kΛ)×⎧⎪⎨⎪⎩tanhΛcotψsinkψ+coskψ,μ∈[0,ρ),ktanhΛ+1,μ=ρ,tanhΛcothΨsinhkΨ+coshkΨ,μ∈(ρ,1]. (27b)
###### Proof.

See proof in Appendix A.1. ∎

###### Remark 6.

Although and are expressed in terms of trigonometric and hyperbolic functions and angles, one should be aware that they are polynomials of degree .

Figure 1 presents an example curve for each polynomial of the accelerated class and the power method (with ). It should be noted that left side of the figure, small , corresponds to large Hessian eigenvalues, and right side, large , corresponds to small Hessian eigenvalues.

Strongly convex and non-strongly convex regimes. From Figure 1, one could observe the very distinct behaviours of the curves on the two sides of the vertical dashed line: on the left hand side, oscillate; but on the right hand side, are monotonically increasing. Thus, we divide the spectrum space into two parts: the strongly convex regime (left side of vertical dash line in Figure 1), with , corresponding to eigen-spaces that satisfy an -strongly convex condition; the non-strongly convex regime (left side of vertical dash line in Figure 1), with , corresponding to eigen-spaces that break the -strongly convex condition. Note that this partition depends on our choice of parameters .

Based on Figure 1 and Theorem 2, we observe that:

###### Observation 1.

Compared to the power method, polynomials of the accelerated class methods tend to take: larger values for small ’s (left side of Figure 1); smaller values for large ’s (right side of Figure 1).

This observation indicates that these accelerated class methods converge slower than power method in eigen-spaces with very small , or equivalently with large Hessian eigenvalues. Then we immediately have the following important

###### Remark 7.

The accelerated class methods do not accelerate convergence for all cases, specifically for cases in which Hessian eigenvalues are concentrated near the top of the spectrum.

However, they do accelerate in the worst case scenario, as is well-known in literature. In Section 5 we will see that they also provide acceleration in the non-strongly convex regime (for very small eigenvalues of the Hessian).

Reconstruction of excess risk. With explicit expressions of such polynomials, we can reconstruct the excess risk , once given spectral information, i.e. the value of , , or distribution of :

 f(wk)−f(w∗)=∑i∈Iβ(1−μi)P2k(μi)(ξ(i)0)2. (28)

Note that this excess risk is not directly computable, since will be never known. But if distribution of is somehow given or well-approximated, we can calculate an expected excess risk

 E[f(wk)−f(w∗)]=∑i∈Iβ(1−μi)P2k(μi)E[(ξ(i)0)2],

where the expectation is taken over distribution of . For example, if we assume the initialization is isotropic, i.e. , then the convergence rate of the expected excess risk can be computed by .

## 5 Analysis in the Strongly Convex Regime

In this section, we perform analysis on the accelerated class algorithms in the strongly convex regime, based on the explicit expressions for these algorithms. Assuming -strong convexity with would lead to the same results in this regime as the eigen-components evolve independently from each other, and their behaviors only depend on the parameter , which is determined by .

Worst case convergence rate. Recall that the spectral-level convergence rate is solely determined by the value , and that smaller value implies faster convergence. According to the definition of Chebyshev number, Eq.(15), and discussion of the Chebyshev method in Section 3, we have the following claims:

###### Claim 1.

Square of Chebyshev number, , measures the worst case convergence rate, under the strongly convex setting. Formally, ,

 |Pk(μ)|≤Chρ(Pk), ∀Pk∈{P∗k,Rk,Nk},∀k≥1. (29)
###### Claim 2 (Optimality of the Chebyshev semi-iterative algorithm [9]).

After the same number of iterations, Chebyshev algorithm achieves the lowest Chebyshev number among the accelerated class (and all possible first-order methods).

In the following theorem, we show that the Chebyshev number is exactly the polynomial value taken at , which is at the boundary of the regimes. This fact is also illustrated in Figure 1, for Chebyshev method.

###### Theorem 3 (Computation of the Chebyshev numbers).
 Chρ(Pk)=Pk(ρ), ∀Pk∈{P∗k,Rk,Nk},∀k≥1. (30)

Moreover,

 Chρ(P∗k)=1/cosh(kΔ), (31a) Chρ(Rk)=exp(−kΔ)(ktanhΔ+1), (31b) Chρ(Nk)=ρk/2exp(−kΛ)(ktanhΛ+1). (31c)

Eq. (30) of this theorem carries two important messages for the accelerated class: (a), is the worst case scenario; and (b), the worst case convergence rate can be exactly computed by the value , where . Thus we have the explicit expressions of the Chebyshev numbers, shown in Eq. (31).

Comparison of algorithms. Based on the expressions of Chebyshev numbers, we compare the worst case convergence rates across algorithms, as shown below:

###### Theorem 4 (Worst-case comparison).

The worst case convergence rates for the accelerated class algorithms satisfy: ,

 0

The above inequalities are consistent with the optimality of the Chebyshev algorithm and the fact that the accelerated class algorithms converge faster than the power method in the worst case. Moreover, we see that Nesterov’s AGD has the slowest worst-case convergence rate, among the accelerated class algorithms.

Combining Theorems 3 and 4, we get the following corollary which recovers the known convergence rates, which can be found in  [5]:

###### Corollary 1.
 Ch2ρ(P∗k), Ch2ρ(Rk), Ch2ρ(Nk)∼O(exp(−k√κ)).

Theorem 4 provides a qualitative comparison, to compare quantitatively, we look at the asymptotic case. We assume the (pseudo) condition number is sufficiently large, correspondingly is sufficiently close to . Then we have:

###### Theorem 5 (Asymptotic Analysis).

For and small enough , the Chebyshev numbers can be expressed as

 Power: Chρ(μk)=1−k(1−ρ)+o(1−ρ), Chebyshev: Chρ(P∗k)=1−k2(1−ρ)+o(1−ρ), SOR: Chρ(Rk)=1−k2(1−ρ)+o(1−ρ), Nesterov’s: Chρ(Nk)=1−12(k2+k)(1−ρ)+o(1−ρ).

The coefficients, expressed in terms of , of linear term indicate the asymptotic convergence rate. We observe that, for each the accelerated class algorithm, the coefficient of 1st-order term is quadratic in number of iterations . This means faster convergence and is consistent with the fact that , as expected.

Exponential spectral-level convergence rate. The following theorem states that each of the accelerated class algorithms converges exponentially in each eigen-space:

###### Theorem 6 (Exponential Convergence).

Define , then . And moreover,

 ∀δ∈[0,Δ), limk→∞ekδP∗k(μ)=limk→∞ekδRk(μ)=0; ∀δ∈[0,~Δ), limk→∞ekδNk(μ)=0.

These exponential spectral-level convergence rates are stronger than the results obtained in [14], in which a super-polynomial convergence rate is obtained.

### 5.1 Discussion on Nesterov’s AGD

According to the polynomial expression of , in Eq.(27b), Nesterov’s AGD seems to be a hybrid of power method and SOR. Specifically, the term corresponds to running iterations of power method, and the rest terms correspond to running more iterations of SOR, but on a ”square rooted” spectrum, i.e. , .

Noticing Observation 1 and the appearance of the term , it is reasonable to expect that Nesterov’s AGD performs better than Chebyshev and SOR in eigen-spaces with larger Hessian-eigenvalue (correspondingly smaller , but performs worse in eigen-spaces with smaller Hessian-eigenvalue (correspondingly larger .

Slower worst case convergence rate. Although Nesterov’s AGD also have exponential convergence rates, as shown in Theorem 6, the following theorem separates it from the other two accelerated class algorithms, by showing that it has a relatively slower worst case convergence rate.

###### Theorem 7.

Let as in Theorem 6, , s.t. ,

 limk→∞ekδNk(ρ)=∞. (35)

This fact is more explicitly illustrated in the asymptotic case, as shown in Theorem 5. The existence of before makes Nesterov’s AGD has a relatively larger Chebyshev number, hence converges slower in the worst case scenario.

Therefore, we conclude that Nesterov’s AGD is not the optimal method in the sense of accelerating the worst-case scenario.

## 6 Parametrized Accelerated Methods

As pointed out in the introduction, the assumption of bounded and known condition number often does not hold in practice and can be problematic in both analysis and algorithm implementation:

Smooth kernel methods and neural networks are known to have very large or even unbounded condition numbers [3, 17]. These condition numbers are generally difficult to estimate, since the estimation is prohibitively costly and numerically unstable. When the estimation is poor, there is no theoretical guarantee for the validity of the accelerated class algorithms. Even if the condition number is known or well-estimated but very large (e.g., , the exponential theoretical rate can still be very slow, and potentially requires more computation than the Newton’s method.

To address this issue, we propose to parametrize the accelerated class algorithms by treating , or, equivalently, the “condition number” , as a free parameter.

The parametrization allows eigenvalues to appear in the non-strongly convex regime, s.t. . We validate the parametrized accelerated class algorithms by showing that they also converge in the non-strongly convex regime, i.e. when . Moreover, we prove that these algorithms converge exponentially fast for each eigenvalue. Additionally, we show in the non-strongly convex regime accelerated class methods converge uniformly faster than ordinary gradient descent (the power method).

### 6.1 Performance in Non-strongly Convex Regime

The validity of the accelerated class algorithms in non-strongly convex regime is guaranteed by the following convergence theorem:

###### Theorem 8 (Exponential convergence).

Chebyshev, SOR, and Nesterov’s AGD converge exponentially in every eigen-space in the non-convex regime, i.e.

 ∀δ∈[0,Δ−Θ),limk→∞ekδP∗k(μ)=limk→∞ekδRk(μ)=0; ∀δ∈[0,Λ−Ψ),limk→∞ekδNk(μ)=0.
###### Remark 8.

Since both and depend on , the spectral-level convergence rates should also depend on , with smaller (correspondingly larger Hessian-eigenvalue) having relatively faster convergence rate.

Compare to the exponential spectral-level convergence in strongly convex regime, as in Theorem 6, this exponential convergence is not uniform on this regime, since the range of valid shrinks to 0 as .

Comparison of algorithms. We also compare the performance of these accelerated class algorithms in the non-strongly convex regime.

###### Theorem 9 (Comparison of algorithms).

In the non-strongly convex regime, i.e. , we have

 (a): 0
###### Remark 9.

Recall that is the polynomial expression of power method (ordinary gradient descent), which we list here for comparison.

Part (a) of Theorem 9 gives an ordering of Chebyshev, SOR and power methods, in the non-strongly convex regime. Part (b) shows that Nesterov’s AGD also converge faster than power method in this regime. From the theorem, we get the following message: in the non-strongly convex regime, the accelerated class algorithms always converge faster than power method (ordinary gradient descent).

Figure 2 briefly illustrates the results of Theorem 9.

We currently do not have direct comparison of Nesterov’s AGD with Chebyshev and SOR methods, but based on Theorem 8, it is reasonable to conjecture that Nesterov’s AGD, at least asymptotically, converges slower than the other two methods.

### 6.2 Choosing Different Acceleration Parameters

Noting that different choices of acceleration parameter result different polynomials, we use superscript , to distinguish this difference.

###### Theorem 10 (Effect of choosing different parameters).

Let , then

 Chρ1(P[1]k)ρ2, P[1]k(μ)>P[2]k(μ).

Figure 3 illustrates this theorem, see caption for details.

Loosely speaking, this theorem states that smaller tends to: (a) accelerate the convergence in strongly convex regime, , by lowering the corresponding Chebyshev number; and (b) slow down convergence in the non-strongly convex regime, . However, readers should be aware that changing parameter will also change the partition of the regimes. This effect is also shown in Figure 3.

## References

• [1] Naum I Achieser. Theory of approximation. Courier Corporation, 2013.
• [2] Yossi Arjevani, Shai Shalev-Shwartz, and Ohad Shamir. On lower and upper bounds for smooth and strongly convex optimization problems. arXiv preprint arXiv:1503.06833, 2015.
• [3] M. Belkin. Approximation beats concentration? An approximation view on inference with smooth radial kernels. ArXiv e-prints, January 2018.
• [4] Yoshua Bengio, Nicolas Boulanger-Lewandowski, and Razvan Pascanu. Advances in optimizing recurrent networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8624–8628. IEEE, 2013.
• [5] Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
• [6] Sébastien Bubeck, Yin Tat Lee, and Mohit Singh. A geometric alternative to nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187, 2015.
• [7] Donald A Flanders and George Shortley. Numerical determination of fundamental modes. Journal of Applied Physics, 21(12):1326–1332, 1950.
• [8] Stanley P Frankel.

Convergence rates of iterative treatments of partial differential equations.

Mathematical Tables and Other Aids to Computation, 4(30):65–75, 1950.
• [9] Gene H Golub and Richard S Varga. Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order richardson iterative methods. Numerische Mathematik, 3(1):147–156, 1961.
• [10] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
• [11] Cornelius Lanczos. Solution of systems of linear equations by minimized iterations. J. Res. Nat. Bur. Standards, 49(1):33–53, 1952.
• [12] Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o (1/k^ 2). In Doklady AN USSR, volume 269, pages 543–547, 1983.
• [13] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
• [14] Andreas Neubauer. On nesterov acceleration for landweber iteration of linear ill-posed problems. Journal of Inverse and Ill-posed Problems, 25(3):381–390, 2017.
• [15] Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
• [16] James D Riley. Iteration procedures for the dirichlet difference problem. Mathematical Tables and Other Aids to Computation, 8(47):125–131, 1954.
• [17] L. Sagun, L. Bottou, and Y. LeCun.

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond.

ArXiv e-prints, November 2016.
• [18] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
• [19] Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.
• [20] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning, pages 2048–2057, 2015.
• [21] David Young. Iterative methods for solving partial difference equations of elliptic type. Transactions of the American Mathematical Society, 76(1):92–111, 1954.

## Appendix A Appendix: Proof of Theorems

###### Lemma 1.
 ∀θ∈[0,π/2],|sinkθ|≤ksinθ,k∈N. (36)
###### Proof.

Obviously, the lemma hold for . In the following, we assume .

First consider the interval :

, both and are positive, and , because of the monotonicity of on and . Since

 (sinkθ)′=kcoskθ,(ksinθ)′=kcosθ, (37)

then . Combining with the fact that

 sinkθ|θ=0=ksinθ|θ=0=0, (38)

one can conclude that

 ∀θ∈[0,π2k],|sinkθ|≤ksinθ. (39)

Then, we consider the interval :

, we have

 |sinkθ|≤1=sinkπ2k≤ksinπ2k≤ksinθ. (40)

where we used Eq.(39) for the second inequality and monotonicity of on for the last inequality.

Hence, we conclude the lemma. ∎

### a.1 Proof of Theorem 2

###### Proof.

To solve the recurrence relations, we follow the technique for solving linear difference equations.

Second order Richardson case. The corresponding recurrence relation is Eq.(23):

 ξk+1=γBξk+(1−γ)ξk−1,k≥1;ξ1=Bξ0.

Now, we define auxiliary polynomials which satisfies

 Qk+1(B)=γBQk(B)+(1−γ)Qk−1(B), k≥1;Q1(B)=γB;Q0(B)=I. (41)

Note that, not like in Eq.(23), we set instead of .

By induction, one can easily verify that is a polynomial in of degree , and that

 ξk=Qk−1(B)ξ1+(1−γ)Qk−2(B)ξ0, ξ≥2. (42)

Replace operator in Eq.(41) by scalar variable , and then we utilize the standard technique for solving linear difference equations: consider as -th power of , then

 qk+1(x)=γxqk(x)+(1−γ)qk−1(x), k≥1, (43)

and

 q1(x)=γx;q0(x)=1. (44)

Eq.(43) reduces to the following quadratic form

 q2(x)=γxq(x)+(1−γ), (45)

which has two roots . The general solution would be

 Qk(x)=c1qk+(x)+c2qk−(x), (46)

where coefficients and are determined by the initial condition Eq.(44).

For this particular case: when , the roots are complex, and have the form , where ; when , are real and have the form , where .

Then, after algebraic manipulations, we have

 Qk(x)=(γ−1)k/2⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩sin(k+1)θsinθif 0≤x<ρ,k+1if x=ρ,sinh(k+1)ΘsinhΘif ρ

Using Eq.(42) and noting that , after some algebraic manipulation, we have the expression for as shown in the Theorem.

Nesterov’s AGD case. The proof for the case of Nesterov’s AGD is very analogous to that of second-order Richardson, so we omit some unnecessary steps. In this case, the auxiliary polynomials now satisfies

 Qk+1(B)=γ′BQk(B)+(1−γ′)BQk−1(B), k≥1;Q1(B)=γ′B;Q0(B)=I. (48)

Please note the appearance of the additional in the term of , and the differently defined parameter .

Therefore, now satisfies, instead of Eq.(45),

 q2(x)=γ′xq(x)+(1−γ′)x. (49)

Then, is in turn

 Qk(x)=(γ′−1)k/2xk/2⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩sin(k+1)ψsinψif 0≤x<ρ,k+1if x=ρ,sinh(k+1)ΨsinhΨif ρ

where and , as defined in Section 4.

By induction, we can show that in this case

 ξk=Qk−1(B)ξ1+(1−γ′)BQk−2(B)ξ0, k≥2. (51)

Noting that , we can have the expression for as shown in the theorem. ∎

### a.2 Proof of Theorem 3

###### Proof.

It is enough to show that,

 ∀μ∈[0,ρ],|Pk(μ)|≤Pk(ρ). (52)

Let’s prove case by case:

Chebyshev. Note that . According to Eq.(6) and (18),

 |Pk(μ)|=|cos(kcos−1(μ/ρ))|cosh(kcosh−1(1/ρ))≤1cosh(kcosh−1(1/ρ))=Pk(ρ). (53)

Second-order Richardson. Since and , then . According to Eq.(27a),

 |Rk(μ)| = ρexp(−kΔ)⋅|sinhΔcosθsinkθsinθ+coshΔcoskθ| (54) ≤ ρexp(−kΔ)⋅(|sinhΔcosθsinkθsinθ|+|coshΔcoskθ|) ≤ ρexp(−kΔ)⋅(sinhΔ|sinkθ|sinθ+coshΔ) ≤ ρexp(−kΔ)⋅(ksinhΔ+coshΔ),

where the last inequality holds true because of Lemma 1.

One the other hand, when , the angle , thus

 Rk(ρ)=ρexp(−kΔ)(ksinhΔ+coshΔ). (55)

Combining the above two equations, we conclude the theorem for second-order Richardson case.

Nesterov’s AGD. This argument is similar to the second-order Richardson case. The angle is in the interval . According to Eq.(27b),

 |Nk(μ)| = μk/2√ρexp(−kΛ)⋅|sinhΛcosψsinkψ/sinψ+coshΛcoskψ| (56) ≤ μk/2√ρexp(−kΛ)⋅(|sinhΛcosψsinkψ/sinψ|+|coshΛcoskψ|) ≤ μk/2√ρexp(−kΛ)⋅(sinhΛ|sinkψ/sinψ|+coshΛ) ≤ ρk/2√ρexp(−kΛ)⋅(ksinhΛ+coshΛ) = Nk(ρ),

where we applied Lemma 1 again in the last inequality. ∎

We show that