Existing analysis of optimization algorithms focuses on a worst-case analysis (nemirovski1995information; nesterov2004introductory). Given any input from a function class, no matter how unlikely, this type of analysis provides a complexity bound. This can lead to misleading results where the worst-case running time is much worse than the observed running time in practice.
Average-case analysis provides instead the expected
complexity of an algorithm over a class of problems, and is more representative of the typical behavior of the algorithm. Contrary to the more classical worst-case analysis, which only requires knowledge of the largest and smallest eigenvalue of the Hessian, the average-case analysis requires a more fine-grained knowledge of the spectrum. In particular, it requires knowledge of the probability density function of a random eigenvalue, also known as theexpected spectral density.
Not much attention has been given to the average-case analysis of optimization algorithms due to the dependency on the Hessian spectrum. However, a blessing of the high-dimensional regime is that the spectrum becomes surprisingly predictable. For random matrices, this has been well known since the seminal work of 10.2307/1970079
, who used it to model nuclei of heavy atoms. Recent work has shown that the spectrum of large deep learning models is equally predictable and that it can be approximated using classical models from random matrix theory like the Marchenko-Pastur distribution(sagun2017empirical; martin2018implicit)
. As the parameters of this distribution can be estimated trough the empirical moments (which can be estimated from traces of powers of the Hessian) it opens the door to the development of new algorithms with better average-case convergence.
In this work we make the following main contributions:
We derive a framework to construct the average-case optimal gradient-based method.
We propose a family of practical algorithms using different models of the expected spectral density: Marchenko-Pastur, uniform and exponential.
Experiments showing the performance of this method on synthetic and real datasets. These show that the method is highly competitive with respect to traditional accelerated methods, and unlike the latter, does not require knowledge of the strong convexity constant.
1.1 Related work
We comment on two of the main ideas behind the proposed methods: polynomial-based iterative methods and spectral density estimation.
Polynomial-based iterative methods.
Our work draws heavily from the classical framework of polynomial-based iterative methods (fischer1996polynomial) that originated with the Chebyshev iterative method of (flanders1950numerical) and was later instrumental in the development of the celebrated conjugate gradient method (hestenes1952methods). Recently, this framework has been used to derive accelerated gossip algorithms (berthier2018accelerated) and accelerated algorithms for smooth games (azizian2020accelerating), to name a few. Although originally derived for the worst-case analysis of optimization algorithms, in this work we extend this framework to analyze also the average-case runtime.
Spectral density estimation.
The realization that deep learning networks behave as linear models in the infinite-width limit (jacot2018neural; novak2018bayesian) has sparked a renewed interest in the study of spectral density of large matrices. ghorbani2019investigation developed efficient tools for estimating the empirical spectrum of very large models and jacot2019asymptotic; pennington2017nonlinear derive new models for the spectral density function of neural networks, to name a few.
Throughout the paper we denote vectors in lowercase boldface () and matrices in uppercase boldface letters (). Probability density functions and eigenvalues are written in greek letters (), while polynomials are written in uppercase latin letter (). We will often omit integration variable, with the understanding that is a shorthand for
2 Average-case Analysis
In this section we derive the necessary framework to talk about average-case optimal algorithms. The main result of this section is Theorem 2.1 that relates the expected error with the other quantities that we will find easier to manipulate, like the (to be defined) residual polynomial. This will allow us in the next section to pose the problem of finding an optimal method as a best approximation problem in the space of polynomials.
Let be a symmetric positive-definite matrix and , both sampled from a probability space . We consider the quadratic minimization problem
and will be interested in quantifying the expected error , where is the -th update of a first-order method starting from and is the expectation over the random .
Problem (OPT) subsumes the quadratic minimization problem but the notation above will be more convenient for our purposes.
The expected error is over the inputs and not over any randomness of the algorithm, as is common in the stochastic literature. In this paper we will only consider deterministic algorithms.
To solve (OPT), we will consider first-order methods. These are methods in which the sequence of iterates is in the span of previous gradients, i.e.,
This class of algorithms includes gradient descent and momentum, but not quasi-Newton methods for example, since the preconditioner would allow the iterates to go outside of the span. Furthermore, we will only consider oblivious methods, that is, methods in which the coefficients of the update are known in advance and don’t depend on previous updates. This leaves out some methods that don’t generalize beyond quadratics like conjugate gradient.
There is an intimate link between first order methods and polynomials that simplifies the analysis of gradient-based methods. The following proposition showcases this relation by relating the error at iteration with the error at initialization and the residual polynomial.
(hestenes1952methods) Let be generated by a first-order method. Then there exists a polynomial of degree such that that verifies
Following (fischer1996polynomial), we will refer to this polynomial as the residual polynomial.
(Gradient descent). In the case of gradient descent, the residual polynomial has a remarkably simple form. Consider the update . Subtracting on both sides gives
and so the residual polynomial is .
A convenient way to collect statistics on the problem is through the empirical spectral density. Let be the eigenvalues of . We define the empirical spectral density as
where denote a Dirac delta, i.e., the function equal to zero everywhere except at and whose integral over the entire real line is equal to one.
We will now need a tool to collect information about the typical spectrum for matrices in our probability space. For this, we will use the expected spectral density, which corresponds to the law of a random eigenvalue from a random matrix . For notational convenience, we will denote it :
Example 2 (Marchenko-Pastur distribution)
Consider a matrix . Then it is known that the expected spectral distribution of converges to to the Marchenko-Pastur distribution (marchenko1967distribution) as and at a rate in which . The Marchenko-Pastur distribution is defined as
here , are the extreme nonzero eigenvalues, is a Dirac delta at zero, which disappears if .
In practice, the Marchenko-Pastur distribution it is often a remarkably good approximation to the spectral distribution of high-dimensional models, even for data that might not verify the i.i.d. assumption, see Figure 2.
The last ingredient before the main result of this section is a simplifying assumption on the initialization that we make throughout the rest of the paper.
We assume that is independent of and
This is verified for instance when both and are drawn independently from a distribution with scaled identity covariance. We are finally in a position to state the main result of this section, an identity that relates on one side the expected error and on the other wide the initialization error, residual polynomial and expected spectral density.
Let be generated by a first-order method, associated to the polynomial . Then we can decompose the expected error at iteration as
This identity is remarkable in that it splits the expected error of an algorithm into its three main factors:
The distance to optimum at initialization enters through the constant , which is the diagonal scaling in Assumption 1.
The optimization method enters in the formula through its residual polynomial . The main purpose of the rest of the paper will be to find optimal choices for this polynomial (and its associated method).
The difficulty of the problem class enters through the expected spectral density .
Although we will not use it in this paper, similar formulas can be derived for the objective and gradient suboptimality:
3 Average-case Acceleration
Once we have introduced the average-case analysis of optimization algorithms, a natural question to ask is: what is the best method in terms of expected error? Our goal in this section is to give a constructive answer to this question in terms of orthogonal polynomials.
Definition 4 (Residual orthogonal polynomial)
We will say that are a sequence of orthogonal residual polynomials with respect to the weight function if they verify for all and
It is well a well known result (see for instance gautschi1996orthogonal) that orthogonal polynomial follow a three-term recursion of the form
Due to the number of degree of freedom, the polynomialcan be defined up to a constant. In our case, we will be interested in residual polynomials (i.e., those that verify ). Residual orthogonal polynomials verify a more specific recursion:
Theorem 5 (Three-term recurrence)
(fischer1996polynomial, §2.4) Any sequence of residual orthogonal polynomials verifies the three-term recurrence
for some scalars , with and .
Although there are algorithms to compute the coefficients and recursively (c.f., fischer1996polynomial), numerical stability and computational costs typically make these methods unfeasible for our purposes. In chapters 4–6 we will see how to compute these coefficients for specific distributions of .
We can now state the main result of this section, which gives a constructive algorithm for the method with minimal expected error.
Let be the residual orthogonal polynomials of degree with respect to the weight function , and let be the constants associated with its three-term recurrence. Then the algorithm
has the smallest expected error over the class of oblivious first-order methods. Moreover, its expected error is:
The algorithm (15), although being optimal over the space of all first-order methods, does not require storage of previous gradients. Instead, it has a very convenient momentum-like form and only requires to store two -dimensional vectors.
(Relationship with Conjugate Gradient). The derivation of the proposed method bears a strong resemblance with the Conjugate Gradient method (hestenes1952methods). One key conceptual difference is that the conjugate gradient constructs the optimal polynomial for the empirical spectral density, while we construct a polynomial that is optimal only for the expected spectral density. A more practical advantage is that the proposed method is more amenable to non-quadratic minimization.
The previous Theorem gives a recipe to construct an optimal algorithm from a sequence of residual orthogonal polynomials. However, in practice we may not have access to the expected spectral density , let alone its sequence of residual orthogonal polynomials.
4 Optimal method under the exponential distribution
In this section, we assume that the expected spectral density follows an exponential distribution:
where is the rate parameter of the distribution. This setting is similar to the analysis of algorithms for minimizing convex, non-smooth functions.
Because of Theorem 3.1, to derive the optimal algorithm, its sufficient to find the sequence of residual orthogonal polynomials with respect to the weight function . Orthogonal (non-residual) polynomials for this weight function have been studied under the name of generalized Laguerre polynomials (abramowitz1972handbook, p. 781). Finding the residual orthogonal polynomials is then just a matter of finding the appropriate normalization, which is given by the following lemma:
The sequence of scaled Laguerre polynomials
are a family of residual orthogonal polynomials with respect to the measure .
This gives the method with best expected error for a problem class whose expected spectral density is a decaying exponential. The resulting algorithm is surprisingly simple:
[logo= ]Accelerated Gradient for Decaying Exponential Input: Initial guess ,
Algorithm: Iterate over
In this distribution, and unlike in the MP that we will see in the next section, the largest eigenvalue is not bounded so this method is more akin to subgradient descent and not gradient descent. Note that because of this, the step-size is decreasing.
Note the striking similarity with the averaged gradient method of (polyak1992acceleration) (here in the presentation of (flammarion2015averaging, §2.2)):
The difference between (EXP) and the previous is where the gradient is computed. In this first case, the method can be seen as an averaged gradient decent, where the gradient is evaluated at the averaged point , while in the second the gradient is computed at the last iterate .
The exponential distribution has a free parameter . Since the expected value of this distribution is , we can estimate the parameter from the sample mean, which is . Hence, we fit this parameter as .
4.1 Rate of convergence
Although this will typically not be the case, for this algorithm it will be possible to give a simple expression for the expected error. The next theorem shows that it converges at rate .
In the case of the convex non-smooth functions, optimal algorithm achieves a (worst-case) rate which is (see for instance (nesterov2009primal)). We thus achieve acceleration, by assuming the function quadratic with the exponential as average spectral density.
5 Optimal method under the Marchenko-Pastur distribution
In this section, we will derive the optimal method under the Marchenko-Pastur distribution , introduced in Example 2. As in the previous section, the first step is to construct a sequence of residual orthogonal polynomials.
The following sequence of orthogonal residual polynomials:
with , , is orthogonal with respect to the weight function .
We show in Appendix Appendix C.2 that these polynomials are shifted Chebyshev polynomials of the second kind. Surprisingly, the Chebyshev of the first kind are used to minimize over the worst case. From the optimal polynomial, we can write the optimal algorithm for (OPT) when the spectral density function of is the Marchenko-Pastur distribution. By using Theorem 3.1, we obtain Algorithm MP-OPT.
It’s instructive to compare the residual polynomials of different methods to better understand their convergence properties. In Figure 3 we plot the residual polynomials for gradient descent, it’s classical worst-case analysis accelerated variant (Chebyshev polynomials) and the average-case analysis accelerated (under the MP distribution) variant above.
5.1 Asymptotic behavior
We also study the asymptotic behavior of this algorithm as . It suffices to solve
The sequence converges to when , otherwise it converges to . Replacing the value in Algorithm MP-OPT by its asymptotic value gives the following simplified accelerated gradient method.
Algorithm MP-OPT corresponds to a gradient descent with a variable momentum and step-size terms, all converging to a simpler one (Algorithm MP-ASYMPT). However, even if we assume that the spectral density of the Hessian is the Marchenko Pastur distribution, we still need to estimate the hyper-parameters and . The next section gives a way to estimate cheaply those hyper-parameters.
5.2 Hyper-parameters estimation
Algorithm MP-OPT and Algorithm MP-ASYMPT require the knowledge of the first two moments of the MP distribution and . It is important to enforce the largest eigenvalue to lie inside the support of the distribution, as otherwise the constraint that the largest eigenvalue is inside the support of the MP ,. With the notation , the solution is given by .
6 Optimal method under the uniform distribution
We now focus on the (unnormalized) uniform distribution over:
We show in Appendix C.3 that a sequence of orthogonal residual polynomials with respect to this density is a sequence of shifted Legendre polynomials. Legendre polynomial are orthogonal w.r.t. the uniform distribution in and are defined as
We detail in Appendix C.3 the necessary translation and normalization steps to obtain the sequence of residual orthogonal polynomials.
[logo= ]Accelerated Gradient for Uniform Distribution Input: Initial guess , and .
Algorithm: Iterate over
Like the Marchenko-Pastur accelerated gradient, the parameters and can be estimated trough the moment of the uniform distribution.
We compare the proposed methods and classical accelerated methods on settings with varying degree of mismatch with our assumptions. We first compare them on quadratics generated from a synthetic dataset, where the empirical spectral density is (approximately) a Marchenko-Pastur distribution. We then compare these methods on another quadratic problem, but this time generated using two non-synthetic datasets, where the MP assumption breaks down. Finally, we compare these methods on a computer vision problem. We will see that these methods perform reasonably well in this scenario, although being far from their original quadratic deterministic setting. A full description of datasets and methods can be found inAppendix D.
Synthetic quadratics. We consider the least squares problem with objective function , where each entry of and
are generated from an i.i.d. random Gaussian distribution. Using different ratios we generate problems that are convex () and strongly convex (). For non-strongly convex problems, since the parameters of the Chebyshev method are no longer well defined, we switch to the momentum method of (ghadimi2015global). This is denoted “Modified” Chebyshev.
Logistic regression. Using synthetic (iid) data, we compare the proposed methods on a logistic regression problem. We generate two datasets, first one with and second one with . Results shown in the first two columns of Figure 5.
Stochastic convolutional neural network. We adapt the proposed (MP-ASYMPT
) algorithm into a stochastic optimization algorithm by replacing the true gradient with a stochastic estimate and adding a decreasing step-size (Stochastic MP). We compare it against SGD with momentum on the task of fitting a convolutional neural network (2 convolutional layers + Max-pool + FC layer) CIFAR10 dataset using a convolutional neural network. Results shown in the last two columns of Figure5.
Since in this case computing the largest eigenvalue is expensive and we no longer have the issue of eigenvalues outside of the support due to the decreasing step-size, we estimate both distribution parameters from the empirical moments. Let , be the first two empirical moments, approximated using the Hutchinson trace estimator (hutchinson1990stochastic). Matching the first two moments results in , .
8 Conclusion and Future Work
In this work, we first developed an average-case analysis of optimization algorithms, and then used it to develop a family of novel algorithms that are optimal under this average-case analysis. We outline potential future work directions.
Mixture of densities to model outlier eigenvalues
. Often the MP distribution fails to model accurately the outlier eigenvalues that arise in real data (see e.g. last row Fig.4). A potential solution would be to consider mixtures, where one density is modeling the bulk and another one is modeling outlier eigenvalues.
Non-quadratic and stochastic extension. As seen in Fig. 5, the methods are applicable and perform well beyond the quadratic setting in which they were conceived. Currently we currently lack the theory to explain this.
Asymptotic behavior. Some of the proposed algorithms converge asymptotically towards to Polyak momentum (§5.1). It is still to be realized the generality of this phenomenon and its implications in terms of average-case optimality for Polyak momentum.
We would like to thank our colleagues Adrien Taylor for inspiring early discussions and Nicolas Le Roux, Courtney Paquette, Geoffrey Negiar and Nicolas Loizou for fruitful discussion and feedback on the manuscript.
We start by recalling the definition of expectation of a random measure:
Definition 13 (tao2012topics)
Given a random measure , its expected measure is the measure that satisfies
for any continuous with compact support on .
Proof This proof mostly follows from the identity of Proposition 2.1 and the definitions of expected spectral density
Before we go into the proof, we state the following auxiliary result
(fischer1996polynomial) The residual polynomial of degree that minimizes is given by the degree residual orthogonal polynomial with respect to the weight function .
Let be the residual orthogonal polynomial with respect to the weight function . In light of the previous lemma, this corresponds to the method with minimal expected error. We will now prove that it corresponds to the algorithm (15)
Removing from both sides of Eq. (15) and using we have
However, we have that and are polynomials, since
Using this argument recursively, we have that
Since, by construction, is the polynomial that minimizes the expectee error, the algorithm is optimal. We now show that the square in the integral can be simplified. Without any loss of generality we consider . Indeed,
Let . This is a polynomial of degree since we took , then we removed the independent term, then divided by . Since the sequence forms a basis of polynomial, we have that is orthogonal to all polynomials of degree less than w.r.t. . In this case, is orthogonal to , thus
where the last equality follows from the fact that is a normalized polynomial, thus .
Appendix Appendix B Manipulation Over Polynomials
This section summarize techniques used to transform the recurrence of orthogonal polynomials. We list these techniques here, but they are detailed in Appendix Appendix B.
In Appendix Appendix B.1, Theorem Appendix B.1, we show how to transform the recurrence of a polynomial , orthogonal w.r.t. , into , orthogonal to . It is a common situation where we have an explicit recurrence of orthogonal polynomials for the density , but not for .
However, Theorem Appendix B.1 requires that the polynomials in the initial sequence are monic, which means that the coefficient associated to the largest power is one. Unfortunately, most of explicit recurrences are not monic. In Appendix Appendix B.2, more precisely in Proposition Appendix B.1, we show a simple transformation of the recurrence of an orthogonal polynomial to make it monic.
Finally, to build an algorithm, we need to have residual polynomials, i.e., polynomials such that . In Appendix Appendix B.3, Proposition Appendix B.2, we give a technique that normalize the sequence to create residual polynomials.
An typical application is shown in Section 6. We start with the sequence of orthogonal polynomials orthogonal w.r.t. the uniform distribution, we apply Proposition Appendix B.1 to make the polynomials monic, then Theorem Appendix B.1 and finally Proposition Appendix B.2 to finally deduce the optimal algorithm.
Appendix B.1 Computing the orthogonal polynomial w.r.t from using kernel polynomials
Sometimes, it may happen that we have an explicit expression of a sequence of polynomial orthogonal w.r.t. the weight function , but not to . In this case, we have to transform the sequence into another one, , orthogonal w.r.t. . The following theorem addresses this situation, but in a more general setting.
Theorem Appendix B.1
(gautschi1996orthogonal, Thm. 7) Assume the sequence of polynomials is orthogonal w.r.t. the weight function . Then, the sequence of polynomial defined as
is orthogonal w.r.t the weight function if . This condition is automatically satisfied if is outside the support of .
Moreover, if the recurrence for the ’s reads
(i.e., is monic), then the recurrence for ’s reads
and is well defined if .
This theorem allows us to deduce the sequence of orthogonal polynomials from the knowledge of a sequence of orthogonal polynomials w.r.t . Using this theorem recursively can be particularly useful. For example, if the sequence for the weight function is known, then Theorem Appendix B.1 allows us to deduce the optimal polynomial for the quantities (9), (10) and (11), since it requires orthogonal polynomials w.r.t for .
Appendix B.2 Monic orthogonal polynomials
Theorem Appendix B.1 requires the sequence of polynomials to be monic, which means that the coefficient associated to the largest power is equal to one. In the recursion, this means the coefficient should be equal to one. The following proposition shows how to normalize the recurrence to end with monic polynomials.
Proposition Appendix B.1
Let be defined as
Then, we can transform into its monic version using the recurrence
Proof We start with the definition of ,
Let . Then,
In order to have , we need , thus .
Appendix B.3 Transformation into residual orthogonal polynomials
Building the optimal algorithm requires the polynomial to be normalized. The next proposition shows how to transform a sequence of orthogonal polynomials into a sequence of normalized polynomials .
Proposition Appendix B.2
Assume we can generate a sequence of orthogonal polynomials using coefficients . Then, if all , we can normalize the sequence into
Proof Let . Then,
Let . Since we need
This gives the recurrence
It remains to compute the initial value, . However,
This means since we need .
Using an additional sequence , we can transform easily any sequence of orthogonal polynomial into its normalized version. Now, we show how to design an optimization algorithm for quadratic that builds the optimal polynomial.
Appendix Appendix C Optimal Polynomials
Appendix C.1 Optimal polynomials for the exponential distribution
Proof We start with the definition of generalized Laguerre polynomials,
which are orthogonal w.r.t. to the weight function
In our case, we aim to find the sequence of orthogonal polynomial w.r.t. the weight function , so we fix . To make the notation lighter, we now remove the superscript. The polynomials of this sequence are nor residual, thus we have to apply Proposition (Appendix B.2). From this proposition, we have that
It is possible to show that , thus
According to Proposition (Appendix B.2), using
Proof We assume for simplicity. First, we use the useful summation property of Laguerre polynomials from (abramowitz1972handbook, equ. (22.12.6)),
Thus, the following integral simplifies into
where the last equality follows from the orthogonality of Laguerre polynomials (see Theorem 9). Moreover, it can be shown that
This means that
However, is the non-normalized version of Laguerre polynomial, thus it needs to be divided by . It can be shown easily that , thus
The last step consists in evaluating the bound in Theorem 9 with , which gives
Appendix C.2 Optimal polynomials for the Marchenko-Pastur distribution
Proof Let be the -th Chebyshev polynomial of the second kind. By definition, these polynomials are orthogonal w.r.t the semi-circle law , i.e.,
Now, consider the parametrization