Average-case Acceleration Through Spectral Density Estimation

02/12/2020 ∙ by Fabian Pedregosa, et al. ∙ 0

We develop a framework for designing optimal quadratic optimization methods in terms of their average-case runtime. This yields a new class of methods that achieve acceleration through a model of the Hessian's expected spectral density. We develop explicit algorithms for the uniform, Marchenko-Pastur, and exponential distributions. These methods are momentum-based gradient algorithms whose hyper-parameters can be estimated without knowledge of the Hessian's smallest singular value, in contrast with classical accelerated methods like Nesterov acceleration and Polyak momentum. Empirical results on quadratic, logistic regression and neural networks show the proposed methods always match and in many cases significantly improve over classical accelerated methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Existing analysis of optimization algorithms focuses on a worst-case analysis (nemirovski1995information; nesterov2004introductory). Given any input from a function class, no matter how unlikely, this type of analysis provides a complexity bound. This can lead to misleading results where the worst-case running time is much worse than the observed running time in practice.

Average-case analysis provides instead the expected

complexity of an algorithm over a class of problems, and is more representative of the typical behavior of the algorithm. Contrary to the more classical worst-case analysis, which only requires knowledge of the largest and smallest eigenvalue of the Hessian, the average-case analysis requires a more fine-grained knowledge of the spectrum. In particular, it requires knowledge of the probability density function of a random eigenvalue, also known as the

expected spectral density.

Not much attention has been given to the average-case analysis of optimization algorithms due to the dependency on the Hessian spectrum. However, a blessing of the high-dimensional regime is that the spectrum becomes surprisingly predictable. For random matrices, this has been well known since the seminal work of 10.2307/1970079

, who used it to model nuclei of heavy atoms. Recent work has shown that the spectrum of large deep learning models is equally predictable and that it can be approximated using classical models from random matrix theory like the Marchenko-Pastur distribution

(sagun2017empirical; martin2018implicit)

. As the parameters of this distribution can be estimated trough the empirical moments (which can be estimated from traces of powers of the Hessian) it opens the door to the development of new algorithms with better average-case convergence.

Figure 1: A worst-case analysis can lead to misleading results where the worst-case running times is much worse than the observed running time. Above: convergence of individual random square least squares in gray, while the average suboptimality (orange) is well below the worst-case (blue).

In this work we make the following main contributions:

  1. [leftmargin=*]

  2. We derive a framework to construct the average-case optimal gradient-based method.

  3. We propose a family of practical algorithms using different models of the expected spectral density: Marchenko-Pastur, uniform and exponential.

  4. Experiments showing the performance of this method on synthetic and real datasets. These show that the method is highly competitive with respect to traditional accelerated methods, and unlike the latter, does not require knowledge of the strong convexity constant.

1.1 Related work

We comment on two of the main ideas behind the proposed methods: polynomial-based iterative methods and spectral density estimation.

Polynomial-based iterative methods.

Our work draws heavily from the classical framework of polynomial-based iterative methods (fischer1996polynomial) that originated with the Chebyshev iterative method of (flanders1950numerical) and was later instrumental in the development of the celebrated conjugate gradient method (hestenes1952methods). Recently, this framework has been used to derive accelerated gossip algorithms (berthier2018accelerated) and accelerated algorithms for smooth games (azizian2020accelerating), to name a few. Although originally derived for the worst-case analysis of optimization algorithms, in this work we extend this framework to analyze also the average-case runtime.

Spectral density estimation.

The realization that deep learning networks behave as linear models in the infinite-width limit (jacot2018neural; novak2018bayesian) has sparked a renewed interest in the study of spectral density of large matrices. ghorbani2019investigation developed efficient tools for estimating the empirical spectrum of very large models and jacot2019asymptotic; pennington2017nonlinear derive new models for the spectral density function of neural networks, to name a few.

Notation.

Throughout the paper we denote vectors in lowercase boldface (

) and matrices in uppercase boldface letters (). Probability density functions and eigenvalues are written in greek letters (), while polynomials are written in uppercase latin letter (). We will often omit integration variable, with the understanding that is a shorthand for

2 Average-case Analysis

In this section we derive the necessary framework to talk about average-case optimal algorithms. The main result of this section is Theorem 2.1 that relates the expected error with the other quantities that we will find easier to manipulate, like the (to be defined) residual polynomial. This will allow us in the next section to pose the problem of finding an optimal method as a best approximation problem in the space of polynomials.

Let be a symmetric positive-definite matrix and , both sampled from a probability space . We consider the quadratic minimization problem

(OPT)

and will be interested in quantifying the expected error , where is the -th update of a first-order method starting from and is the expectation over the random .

Remark 1

Problem (OPT) subsumes the quadratic minimization problem but the notation above will be more convenient for our purposes.

Remark 2

The expected error is over the inputs and not over any randomness of the algorithm, as is common in the stochastic literature. In this paper we will only consider deterministic algorithms.

To solve (OPT), we will consider first-order methods. These are methods in which the sequence of iterates is in the span of previous gradients, i.e.,

(1)

This class of algorithms includes gradient descent and momentum, but not quasi-Newton methods for example, since the preconditioner would allow the iterates to go outside of the span. Furthermore, we will only consider oblivious methods, that is, methods in which the coefficients of the update are known in advance and don’t depend on previous updates. This leaves out some methods that don’t generalize beyond quadratics like conjugate gradient.

There is an intimate link between first order methods and polynomials that simplifies the analysis of gradient-based methods. The following proposition showcases this relation by relating the error at iteration with the error at initialization and the residual polynomial.

Proposition 2.1

(hestenes1952methods) Let be generated by a first-order method. Then there exists a polynomial of degree such that that verifies

(2)

Following (fischer1996polynomial), we will refer to this polynomial as the residual polynomial.

Example 1

(Gradient descent). In the case of gradient descent, the residual polynomial has a remarkably simple form. Consider the update . Subtracting on both sides gives

(3)
(4)

and so the residual polynomial is .

A convenient way to collect statistics on the problem is through the empirical spectral density. Let be the eigenvalues of . We define the empirical spectral density as

(5)

where denote a Dirac delta, i.e., the function equal to zero everywhere except at and whose integral over the entire real line is equal to one.

We will now need a tool to collect information about the typical spectrum for matrices in our probability space. For this, we will use the expected spectral density, which corresponds to the law of a random eigenvalue from a random matrix . For notational convenience, we will denote it :

(6)
Example 2 (Marchenko-Pastur distribution)

Consider a matrix

, where each entry is an iid random variables with mean zero and variance

. Then it is known that the expected spectral distribution of converges to to the Marchenko-Pastur distribution (marchenko1967distribution) as and at a rate in which . The Marchenko-Pastur distribution is defined as

(7)

here , are the extreme nonzero eigenvalues, is a Dirac delta at zero, which disappears if .

In practice, the Marchenko-Pastur distribution it is often a remarkably good approximation to the spectral distribution of high-dimensional models, even for data that might not verify the i.i.d. assumption, see Figure 2.

Figure 2: Spectral properties of high dimensional problems can be approximated with simple models. Above, a fit to the Marchenko-Pastur distribution on the Hessian of a least square problem (20k features) using the 20news dataset (keerthi2005modified).

The last ingredient before the main result of this section is a simplifying assumption on the initialization that we make throughout the rest of the paper.

Assumption 1

We assume that is independent of and

(8)

This is verified for instance when both and are drawn independently from a distribution with scaled identity covariance. We are finally in a position to state the main result of this section, an identity that relates on one side the expected error and on the other wide the initialization error, residual polynomial and expected spectral density.

Theorem 2.1

Let be generated by a first-order method, associated to the polynomial . Then we can decompose the expected error at iteration as

(9)

This identity is remarkable in that it splits the expected error of an algorithm into its three main factors:

  1. [leftmargin=*]

  2. The distance to optimum at initialization enters through the constant , which is the diagonal scaling in Assumption 1.

  3. The optimization method enters in the formula through its residual polynomial . The main purpose of the rest of the paper will be to find optimal choices for this polynomial (and its associated method).

  4. The difficulty of the problem class enters through the expected spectral density .

Remark 3

Although we will not use it in this paper, similar formulas can be derived for the objective and gradient suboptimality:

(10)
(11)

3 Average-case Acceleration

Once we have introduced the average-case analysis of optimization algorithms, a natural question to ask is: what is the best method in terms of expected error? Our goal in this section is to give a constructive answer to this question in terms of orthogonal polynomials.

Definition 4 (Residual orthogonal polynomial)

We will say that are a sequence of orthogonal residual polynomials with respect to the weight function if they verify for all and

(12)

It is well a well known result (see for instance gautschi1996orthogonal) that orthogonal polynomial follow a three-term recursion of the form

(13)

Due to the number of degree of freedom, the polynomial

can be defined up to a constant. In our case, we will be interested in residual polynomials (i.e., those that verify ). Residual orthogonal polynomials verify a more specific recursion:

Theorem 5 (Three-term recurrence)

(fischer1996polynomial, §2.4) Any sequence of residual orthogonal polynomials verifies the three-term recurrence

(14)

for some scalars , with and .

Remark 6

Although there are algorithms to compute the coefficients and recursively (c.f., fischer1996polynomial), numerical stability and computational costs typically make these methods unfeasible for our purposes. In chapters 46 we will see how to compute these coefficients for specific distributions of .

We can now state the main result of this section, which gives a constructive algorithm for the method with minimal expected error.

Theorem 3.1

Let be the residual orthogonal polynomials of degree with respect to the weight function , and let be the constants associated with its three-term recurrence. Then the algorithm

(15)

has the smallest expected error over the class of oblivious first-order methods. Moreover, its expected error is:

(16)
Remark 7

The algorithm (15), although being optimal over the space of all first-order methods, does not require storage of previous gradients. Instead, it has a very convenient momentum-like form and only requires to store two -dimensional vectors.

Remark 8

(Relationship with Conjugate Gradient). The derivation of the proposed method bears a strong resemblance with the Conjugate Gradient method (hestenes1952methods). One key conceptual difference is that the conjugate gradient constructs the optimal polynomial for the empirical spectral density, while we construct a polynomial that is optimal only for the expected spectral density. A more practical advantage is that the proposed method is more amenable to non-quadratic minimization.

The previous Theorem gives a recipe to construct an optimal algorithm from a sequence of residual orthogonal polynomials. However, in practice we may not have access to the expected spectral density , let alone its sequence of residual orthogonal polynomials.

The next sections are devoted to solving this problem through a parametric assumptions on the spectral density. We will consider different options for this parametric distribution: exponential (§4), Marchenko-Pastur (§5), and uniform (§6).

4 Optimal method under the exponential distribution

In this section, we assume that the expected spectral density follows an exponential distribution:

(17)

where is the rate parameter of the distribution. This setting is similar to the analysis of algorithms for minimizing convex, non-smooth functions.

Because of Theorem 3.1, to derive the optimal algorithm, its sufficient to find the sequence of residual orthogonal polynomials with respect to the weight function . Orthogonal (non-residual) polynomials for this weight function have been studied under the name of generalized Laguerre polynomials (abramowitz1972handbook, p. 781). Finding the residual orthogonal polynomials is then just a matter of finding the appropriate normalization, which is given by the following lemma:

Lemma 9

The sequence of scaled Laguerre polynomials

(18)

are a family of residual orthogonal polynomials with respect to the measure .

This gives the method with best expected error for a problem class whose expected spectral density is a decaying exponential. The resulting algorithm is surprisingly simple:

[logo=      ]Accelerated Gradient for Decaying Exponential Input: Initial guess ,

Algorithm: Iterate over

(EXP)
Remark 10

In this distribution, and unlike in the MP that we will see in the next section, the largest eigenvalue is not bounded so this method is more akin to subgradient descent and not gradient descent. Note that because of this, the step-size is decreasing.

Remark 11

Note the striking similarity with the averaged gradient method of (polyak1992acceleration) (here in the presentation of (flammarion2015averaging, §2.2)):

The difference between (EXP) and the previous is where the gradient is computed. In this first case, the method can be seen as an averaged gradient decent, where the gradient is evaluated at the averaged point , while in the second the gradient is computed at the last iterate .

Parameters estimation.

The exponential distribution has a free parameter . Since the expected value of this distribution is , we can estimate the parameter from the sample mean, which is . Hence, we fit this parameter as .

4.1 Rate of convergence

Although this will typically not be the case, for this algorithm it will be possible to give a simple expression for the expected error. The next theorem shows that it converges at rate .

Lemma 12

If we apply Algorithm (EXP) to problem (OPT), where the spectral density of is the decaying exponential , then

(19)

In the case of the convex non-smooth functions, optimal algorithm achieves a (worst-case) rate which is (see for instance (nesterov2009primal)). We thus achieve acceleration, by assuming the function quadratic with the exponential as average spectral density.

5 Optimal method under the Marchenko-Pastur distribution

In this section, we will derive the optimal method under the Marchenko-Pastur distribution , introduced in Example 2. As in the previous section, the first step is to construct a sequence of residual orthogonal polynomials.

Theorem 5.1

The following sequence of orthogonal residual polynomials:

(20)

with , , is orthogonal with respect to the weight function .

[logo=      ]Accelerated Gradient for Marchenko-Pastur Input: Initial guess , MP parameters Preprocessing: , Algorithm (Optimal): Iterate over
(MP-OPT)
Algorithm (Asymptotic variant): Iterate over
(MP-ASYMPT)

We show in Appendix Appendix C.2 that these polynomials are shifted Chebyshev polynomials of the second kind. Surprisingly, the Chebyshev of the first kind are used to minimize over the worst case. From the optimal polynomial, we can write the optimal algorithm for (OPT) when the spectral density function of is the Marchenko-Pastur distribution. By using Theorem 3.1, we obtain Algorithm MP-OPT.

Figure 3: Residual polynomials for gradient descent, the minimax optimal polynomial (Chebyshev of the first kind), and the optimal polynomial for the MP distribution, overlayed with a MP distribution in dashed lines. Gradient descent has the highest value at the edges, which is in line with its known suboptimal dependency on the condition number. The MP polynomial is closer to zero than the Chebyshev except at the borders, which Chebyshev is optimal by construction. Hence the MP polynomial is trading off minimax optimality by being mostly (and hence in expectation) smaller.
More precisely, the worst-case convergence rate (using the maximum value in the interval) is for Chebyshev and for MP. Hence Chebyshev’s worst-case rate is approximately 1.7 times smaller. However, the expected rate (using the weighted integral on the interval) is for MP and for Chebyshev, making the MP algorithm 1.4 times faster in average.

It’s instructive to compare the residual polynomials of different methods to better understand their convergence properties. In Figure 3 we plot the residual polynomials for gradient descent, it’s classical worst-case analysis accelerated variant (Chebyshev polynomials) and the average-case analysis accelerated (under the MP distribution) variant above.

5.1 Asymptotic behavior

We also study the asymptotic behavior of this algorithm as . It suffices to solve

The sequence converges to when , otherwise it converges to . Replacing the value in Algorithm MP-OPT by its asymptotic value gives the following simplified accelerated gradient method.

Algorithm MP-OPT corresponds to a gradient descent with a variable momentum and step-size terms, all converging to a simpler one (Algorithm MP-ASYMPT). However, even if we assume that the spectral density of the Hessian is the Marchenko Pastur distribution, we still need to estimate the hyper-parameters and . The next section gives a way to estimate cheaply those hyper-parameters.

5.2 Hyper-parameters estimation

Algorithm MP-OPT and Algorithm MP-ASYMPT require the knowledge of the first two moments of the MP distribution and . It is important to enforce the largest eigenvalue to lie inside the support of the distribution, as otherwise the constraint that the largest eigenvalue is inside the support of the MP ,. With the notation , the solution is given by .

6 Optimal method under the uniform distribution

We now focus on the (unnormalized) uniform distribution over

:

We show in Appendix C.3 that a sequence of orthogonal residual polynomials with respect to this density is a sequence of shifted Legendre polynomials. Legendre polynomial are orthogonal w.r.t. the uniform distribution in and are defined as

We detail in Appendix C.3 the necessary translation and normalization steps to obtain the sequence of residual orthogonal polynomials.

[logo=      ]Accelerated Gradient for Uniform Distribution Input: Initial guess , and .

Init: .

Algorithm: Iterate over

(UNIF)

Like the Marchenko-Pastur accelerated gradient, the parameters and can be estimated trough the moment of the uniform distribution.

7 Experiments

We compare the proposed methods and classical accelerated methods on settings with varying degree of mismatch with our assumptions. We first compare them on quadratics generated from a synthetic dataset, where the empirical spectral density is (approximately) a Marchenko-Pastur distribution. We then compare these methods on another quadratic problem, but this time generated using two non-synthetic datasets, where the MP assumption breaks down. Finally, we compare these methods on a computer vision problem. We will see that these methods perform reasonably well in this scenario, although being far from their original quadratic deterministic setting. A full description of datasets and methods can be found in

Appendix D.

Synthetic quadratics. We consider the least squares problem with objective function , where each entry of and

are generated from an i.i.d. random Gaussian distribution. Using different ratios we generate problems that are convex (

) and strongly convex (). For non-strongly convex problems, since the parameters of the Chebyshev method are no longer well defined, we switch to the momentum method of (ghadimi2015global). This is denoted “Modified” Chebyshev.

Non-Synthetic quadratics. The last two columns of Fig. 4 compare the same methods, but this time on two real (non-synthetic) datasets. We use two UCI datasets: digits () and breast cancer ().

Logistic regression. Using synthetic (iid) data, we compare the proposed methods on a logistic regression problem. We generate two datasets, first one with and second one with . Results shown in the first two columns of Figure 5.

Stochastic convolutional neural network. We adapt the proposed (MP-ASYMPT

) algorithm into a stochastic optimization algorithm by replacing the true gradient with a stochastic estimate and adding a decreasing step-size (Stochastic MP). We compare it against SGD with momentum on the task of fitting a convolutional neural network (2 convolutional layers + Max-pool + FC layer) CIFAR10 dataset using a convolutional neural network. Results shown in the last two columns of Figure 

5.

Since in this case computing the largest eigenvalue is expensive and we no longer have the issue of eigenvalues outside of the support due to the decreasing step-size, we estimate both distribution parameters from the empirical moments. Let , be the first two empirical moments, approximated using the Hutchinson trace estimator (hutchinson1990stochastic). Matching the first two moments results in , .

Figure 4: Benchmark on quadratic objectives. The top row shows the Hessian’s spectral density overlaid with the MP distribution, while the bottom row shows the objective suboptimality as a function of the number of iterations. In the first and second row, strongly convex and convex (respectively) synthetic problem with MP distribution. In all plots, the (Accelerated Gradient for) Marchenko-Pastur algorithm performs better than the classical worst-case analysis optimal method (Modified Chebyshev), and is close in performance with Conjugate Gradient one, which is optimal for the empirical (instead of expected) spectral density. Note from the first plot that the Marchenko-Pastur method does not necessarily switch to a sublinear convergence like Chebyshev in the presence of zero eigenvalues.
Figure 5: Benchmark on logistic regression and a deep learning task. First two columns is a logistic regression problems on synthetic data, showing that the proposed methods are applicable (and even competitive) outside of the realm of quadratic optimization. Last two are train set loss and test set accuracy on a convolutional neural network task, showing that the MP method can be extended in the stochastic and non-convex setting.

8 Conclusion and Future Work

In this work, we first developed an average-case analysis of optimization algorithms, and then used it to develop a family of novel algorithms that are optimal under this average-case analysis. We outline potential future work directions.

Mixture of densities to model outlier eigenvalues

. Often the MP distribution fails to model accurately the outlier eigenvalues that arise in real data (see e.g. last row Fig.

4). A potential solution would be to consider mixtures, where one density is modeling the bulk and another one is modeling outlier eigenvalues.

Non-quadratic and stochastic extension. As seen in Fig. 5, the methods are applicable and perform well beyond the quadratic setting in which they were conceived. Currently we currently lack the theory to explain this.

Asymptotic behavior. Some of the proposed algorithms converge asymptotically towards to Polyak momentum (§5.1). It is still to be realized the generality of this phenomenon and its implications in terms of average-case optimality for Polyak momentum.

Acknowledgements

We would like to thank our colleagues Adrien Taylor for inspiring early discussions and Nicolas Le Roux, Courtney Paquette, Geoffrey Negiar and Nicolas Loizou for fruitful discussion and feedback on the manuscript.

References

Appendix Appendix A Proofs of Section 2 and 3

We start by recalling the definition of expectation of a random measure:

Definition 13 (tao2012topics)

Given a random measure , its expected measure is the measure that satisfies

(21)

for any continuous with compact support on .

See 2.1

Proof  This proof mostly follows from the identity of Proposition 2.1 and the definitions of expected spectral density

(22)
(23)
(24)

 

See 3.1

Before we go into the proof, we state the following auxiliary result

Lemma 14

(fischer1996polynomial) The residual polynomial of degree that minimizes is given by the degree residual orthogonal polynomial with respect to the weight function .

Proof 

Let be the residual orthogonal polynomial with respect to the weight function . In light of the previous lemma, this corresponds to the method with minimal expected error. We will now prove that it corresponds to the algorithm (15)

Removing from both sides of Eq. (15) and using we have

(25)

However, we have that and are polynomials, since

Using this argument recursively, we have that

(26)

Since, by construction, is the polynomial that minimizes the expectee error, the algorithm is optimal. We now show that the square in the integral can be simplified. Without any loss of generality we consider . Indeed,

(27)

Let . This is a polynomial of degree since we took , then we removed the independent term, then divided by . Since the sequence forms a basis of polynomial, we have that is orthogonal to all polynomials of degree less than w.r.t. . In this case, is orthogonal to , thus

(28)

where the last equality follows from the fact that is a normalized polynomial, thus .  

Appendix Appendix B Manipulation Over Polynomials

This section summarize techniques used to transform the recurrence of orthogonal polynomials. We list these techniques here, but they are detailed in Appendix Appendix B.

In Appendix Appendix B.1, Theorem Appendix B.1, we show how to transform the recurrence of a polynomial , orthogonal w.r.t. , into , orthogonal to . It is a common situation where we have an explicit recurrence of orthogonal polynomials for the density , but not for .

However, Theorem Appendix B.1 requires that the polynomials in the initial sequence are monic, which means that the coefficient associated to the largest power is one. Unfortunately, most of explicit recurrences are not monic. In Appendix Appendix B.2, more precisely in Proposition Appendix B.1, we show a simple transformation of the recurrence of an orthogonal polynomial to make it monic.

Finally, to build an algorithm, we need to have residual polynomials, i.e., polynomials such that . In Appendix Appendix B.3, Proposition Appendix B.2, we give a technique that normalize the sequence to create residual polynomials.

An typical application is shown in Section 6. We start with the sequence of orthogonal polynomials orthogonal w.r.t. the uniform distribution, we apply Proposition Appendix B.1 to make the polynomials monic, then Theorem Appendix B.1 and finally Proposition Appendix B.2 to finally deduce the optimal algorithm.

Appendix B.1 Computing the orthogonal polynomial w.r.t from using kernel polynomials

Sometimes, it may happen that we have an explicit expression of a sequence of polynomial orthogonal w.r.t. the weight function , but not to . In this case, we have to transform the sequence into another one, , orthogonal w.r.t. . The following theorem addresses this situation, but in a more general setting.

Theorem Appendix B.1

(gautschi1996orthogonal, Thm. 7) Assume the sequence of polynomials is orthogonal w.r.t. the weight function . Then, the sequence of polynomial defined as

is orthogonal w.r.t the weight function if . This condition is automatically satisfied if is outside the support of .

Moreover, if the recurrence for the ’s reads

(i.e., is monic), then the recurrence for ’s reads

and is well defined if .

This theorem allows us to deduce the sequence of orthogonal polynomials from the knowledge of a sequence of orthogonal polynomials w.r.t . Using this theorem recursively can be particularly useful. For example, if the sequence for the weight function is known, then Theorem Appendix B.1 allows us to deduce the optimal polynomial for the quantities (9), (10) and (11), since it requires orthogonal polynomials w.r.t for .

Appendix B.2 Monic orthogonal polynomials

Theorem Appendix B.1 requires the sequence of polynomials to be monic, which means that the coefficient associated to the largest power is equal to one. In the recursion, this means the coefficient should be equal to one. The following proposition shows how to normalize the recurrence to end with monic polynomials.

Proposition Appendix B.1

Let be defined as

Then, we can transform into its monic version using the recurrence

(29)
(30)

Proof  We start with the definition of ,

Let . Then,

In order to have , we need , thus .  

Appendix B.3 Transformation into residual orthogonal polynomials

Building the optimal algorithm requires the polynomial to be normalized. The next proposition shows how to transform a sequence of orthogonal polynomials into a sequence of normalized polynomials .

Proposition Appendix B.2

Assume we can generate a sequence of orthogonal polynomials using coefficients . Then, if all , we can normalize the sequence into

where

Proof  Let . Then,

Let . Since we need

This gives the recurrence

It remains to compute the initial value, . However,

This means since we need .  

Using an additional sequence , we can transform easily any sequence of orthogonal polynomial into its normalized version. Now, we show how to design an optimization algorithm for quadratic that builds the optimal polynomial.

Appendix Appendix C Optimal Polynomials

Appendix C.1 Optimal polynomials for the exponential distribution

See 9

Proof  We start with the definition of generalized Laguerre polynomials,

which are orthogonal w.r.t. to the weight function

In our case, we aim to find the sequence of orthogonal polynomial w.r.t. the weight function , so we fix . To make the notation lighter, we now remove the superscript. The polynomials of this sequence are nor residual, thus we have to apply Proposition (Appendix B.2). From this proposition, we have that

It is possible to show that , thus

According to Proposition (Appendix B.2), using

in the sequence of residual orthogonal polynomial gives residual polynomials which are orthogonal w.r.t. the weight function . By Theorem 3.1, this leads to the optimal polynomial.  
See 12

Proof  We assume for simplicity. First, we use the useful summation property of Laguerre polynomials from (abramowitz1972handbook, equ. (22.12.6)),

In particular,

Thus, the following integral simplifies into

where the last equality follows from the orthogonality of Laguerre polynomials (see Theorem 9). Moreover, it can be shown that

This means that

However, is the non-normalized version of Laguerre polynomial, thus it needs to be divided by . It can be shown easily that , thus

The last step consists in evaluating the bound in Theorem 9 with , which gives

(31)

 

Appendix C.2 Optimal polynomials for the Marchenko-Pastur distribution

See 5.1

Proof  Let be the -th Chebyshev polynomial of the second kind. By definition, these polynomials are orthogonal w.r.t the semi-circle law , i.e.,

Now, consider the parametrization