# Implicit Regularization of Accelerated Methods in Hilbert Spaces

We study learning properties of accelerated gradient descent methods for linear least-squares in Hilbert spaces. We analyze the implicit regularization properties of Nesterov acceleration and a variant of heavy-ball in terms of corresponding learning error bounds. Our results show that acceleration can provides faster bias decay than gradient descent, but also suffers of a more unstable behavior. As a result acceleration cannot be in general expected to improve learning accuracy with respect to gradient descent, but rather to achieve the same accuracy with reduced computations. Our theoretical results are validated by numerical simulations. Our analysis is based on studying suitable polynomials induced by the accelerated dynamics and combining spectral techniques with concentration inequalities.

## Authors

• 2 publications
• 68 publications
• ### Accelerating Stochastic Gradient Descent

There is widespread sentiment that it is not possible to effectively uti...
04/26/2017 ∙ by Prateek Jain, et al. ∙ 0

• ### Accelerated Extra-Gradient Descent: A Novel Accelerated First-Order Method

We provide a novel accelerated first-order method that achieves the asym...
06/14/2017 ∙ by Jelena Diakonikolas, et al. ∙ 0

• ### Anderson acceleration of coordinate descent

Acceleration of first order methods is mainly obtained via inertial tech...
11/19/2020 ∙ by Quentin Bertrand, et al. ∙ 0

• ### Encrypted accelerated least squares regression

Information that is stored in an encrypted format is, by definition, usu...
03/02/2017 ∙ by Pedro M. Esperança, et al. ∙ 0

• ### From Averaging to Acceleration, There is Only a Step-size

04/07/2015 ∙ by Nicolas Flammarion, et al. ∙ 0

• ### Acceleration of Descent-based Optimization Algorithms via Carathéodory's Theorem

We propose a new technique to accelerate algorithms based on Gradient De...
06/02/2020 ∙ by Francesco Cosentino, et al. ∙ 0

• ### Revisiting the Role of Euler Numerical Integration on Acceleration and Stability in Convex Optimization

Viewing optimization methods as numerical integrators for ordinary diffe...
02/23/2021 ∙ by Peiyuan Zhang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The focus on optimization is a major trend in modern machine learning, where efficiency is mandatory in large scale problems

[4]. Among other solutions, first order methods have emerged as methods of choice. While these techniques are known to have potentially slow convergence guarantees, they also have low iteration costs, ideal in large scale problems. Consequently the question of accelerating first order methods while keeping their small iteration costs have received much attention, see e.g. [32]

. Since machine learning solutions are typically derived minimizing an empirical objective (the training error), most theoretical studies have focused on the error estimated for this latter quantity. However, it has recently become clear that optimization can play a key role from a statistical point of view when the goal is to minimize the expected (test) error. On the one hand, iterative optimization implicitly bias the search for a solution, e.g. converging to suitable minimal norm solutions

[26]. On the other hand, the number of iterations parameterize paths of solutions of different complexity [30].

The idea that optimization can implicitly perform regularization has a long history. In the context of linear inverse problems, it is known as iterative regularization [10]

. It is also an old trick for training neural networks where it is called early stopping

[14]

. The question of understanding the generalization properties of deep learning applications has recently sparked a lot of attention on this approach, which has be referred to as implicit regularization, see e.g.

[12]. Establishing the regularization properties of iterative optimization requires the study of the corresponding expected error by combining optimization and statistical tools. First results in this sense focused on linear least squares with gradient descent and go back to [6, 30], see also [24]

and references there in for improvements. Subsequent works have started considering other loss functions

[15], multi-linear models [12] and other optimization methods, e.g. stochastic approaches [25, 17, 13].

In this paper, we consider the implicit regularization properties of acceleration. We focus on linear least squares in Hilbert space, because this setting allows to derive sharp results and working in infinite dimension magnify the role of regularization. Unlike in finite dimension learning bounds are possible only if some form of regularization is considered. In particular, we consider two of the most popular accelerated gradient approaches, based on Nesterov acceleration [21] and (a variant of) the heavy-ball method [23]

. Both methods achieve acceleration by exploiting a so called momentum term, which uses not only the previous, but the previous two iterations at each step. Considering a suitable bias-variance decomposition, our results show that accelerated methods have a behavior qualitatively different from basic gradient descent. While the bias decays faster with the number of iterations, the variance increases faster too. The two effect balance out, showing that accelerated methods achieve the same optimal statistical accuracy of gradient descent but they can indeed do this with less computations. Our analysis takes advantage of the linear structures induced by least squares to exploit tools from spectral theory. Indeed, the characterization of both convergence and stability rely on the study of suitable spectral polynomials defined by the iterates. While the idea that accelerated methods can be more unstable, this has been pointed out in

[9] in a pure optimization context. Our results quantify this effect from a statistical point of view. Close to our results is the study in [8], where a stability approach is considered to analyze gradient methods for different loss functions [5].

## 2 Learning with (accelerated) gradient methods

Let the input space be a separable Hilbert space (with scalar product and induced norm ) and the output space be 111As shown in Appendix this choice allows to recover nonparametric kernel learning as a special case.. Let

be a unknown probability measure on the input-output space

, the induced marginal probability on , and the conditional probability measure on given . We make the following standard assumption: there exist such that

 ⟨x,x′⟩≤κ2∀x,x′∈X,ρX-almost surely. (1)

The goal of least-squares linear regression is to solve the

expected risk minimization problem

 infw∈XE(w),E(w)=∫X×R(⟨w,x⟩−y)2dρ(x,y), (2)

where is known only through the i.i.d. samples . In the following, we measure the quality of an approximate solution with the excess risk

 E(^w)−infXE.

The search of a solution is often based on replacing (2) with the empirical risk minimization (ERM)

 minw∈X^E(w),^E(w)=1nn∑i=1(⟨w,xi⟩−yi)2. (3)

For least squares an ERM solution can be computed in closed form using a direct solver. However, for large problems, iterative solvers are preferable and we next describe the approaches we consider.

First, it is useful to rewrite the ERM with vectors notation. Let

with and s.t. for . Here the norm is norm in multiplied by . Let be the adjoint of defined by . Then, ERM becomes

 minw∈X^E(w)=∥∥Xw−y∥∥2n. (4)

### 2.1 Gradient descent and accelerated methods

Gradient descent serves as a reference approach throughout the paper. For problem (4) it becomes

 ^wt+1=^wt−αX∗(X^wt−y) (5)

with initial point and the step-size that satisfy 222 The step-size is the step-size at the -th iteration and satisfies the condition where denotes the operatorial norm. Since the operator is bounded by (which means ) it is sufficient to assume . The progress made by gradient descent at each iteration can be slow and the idea behind acceleration is to use the information of the previous directions in order to improves the convergence rate of the algorithm.

#### Heavy-ball

Heavy-ball is a popular accelerated method that adds the momentum at each iteration

 ^wt+1=^wt−αX∗(X^wt−y)+β(^wt−^wt−1) (6)

with , the case reduces to gradient descent. In the quadratic case we consider it is also called Chebyshev iterative method. The optimization properties of heavy-ball have been studied extensively [23, 31]. Here, we consider the following variant. Let , consider the varying parameter heavy-ball replacing in (6) with defined as:

 αt=4κ2(2t+2ν−1)(t+ν−1)(t+2ν−1)(2t+4ν−1)βt=(t−1)(2t−3)(2t+2ν−1)(t+2ν−1)(2t+4ν−1)(2t+2ν−3),

for and with initialization . With this choice and considering the least-squares problem this algorithm is known as method in the inverse problem literature (see e.g. [10]). This seemingly complex parameters’ choice allows to relates the approach to suitable orthogonal polynomials recursion as we discuss later.

#### Nesterov acceleration

The second form of gradient acceleration we consider is the popular Nesterov acceleration [21]. In our setting, it corresponds to the iteration

 ^wt+1=^vt−αX∗(X^vt−y),^vt=^wt+βt(^wt−^wt−1) (7)

with the two initial points , and the sequence chosen as

 βt=t−1t+β,β≥1. (8)

Differently from heavy-ball, Nesterov acceleration uses the momentum term also in the evaluation of the gradient. Also in this case optimization results are well known [1, 28].

Here, as above, optimization results refer to solving ERM (3), (4), whereas in the following we study to which extent the above iterations can used to minimize the expected error (2). In the next section, we discuss a spectral approach which will be instrumental towards this goal.

## 3 Spectral filtering for accelerated methods

Least squares allows to consider spectral approaches to study the properties of gradient methods for learning. We illustrate these ideas for gradient descent before considering accelerated methods.

### Gradient descent as spectral filtering

Note that by a simple (and classical) induction argument, gradient descent can be written as

 ^wt=αt−1∑j=0(I−α^Σ)jX∗y.

Equivalently using spectral calculus

 ^wt=gt(^Σ)X∗y,% with^Σ=X∗X,

where are the polynomials for all and . Note that, the polynomials are bounded by . A first observation is that converges to as , since converges to . A second observation is that the residual polynomials , which are all bounded by , control ERM convergence since,

In particular, if y is in the range of for some (source condition on y) improved convergence rates can be derived noting that by an easy calculation

 |rt(σ)σq|≤(qα)q(1t)q.

As we show in Section 4, considering the polynomials and allows to study not only ERM but also expected risk minimization (2), by relating gradient methods to their infinite sample limit. Further, we show how similar reasoning hold for accelerated methods. In order to do so, it useful to first define the characterizing properties of and .

### 3.1 Spectral filtering

The following definition abstracts the key properties of the function and often called spectral filtering function [2]. Following the classical definition we replace with a generic parameter .

###### Definition 1.

x The family is called spectral filtering function if the following conditions hold:

(i)

There exist a constant such that, for any

 supσ∈(0,κ2]|gλ(σ)|≤Eλ. (9)
(ii)

Let there exist a constant such that, for any

 supσ∈(0,κ2]|rλ(σ)|≤F0. (10)
###### Definition 2.

(Qualification)
The qualification of the spectral filtering function is the maximum parameter such that for any there exist a constant such that

 supσ∈(0,κ2]|rλ(σ)σq|≤Fqλq. (11)

Moreover we say that a filtering function has qualification if (11) holds for every .

Methods with finite qualification might have slow convergence rates in certain regimes. The smallest the qualification the worse the rates can be.

The discussion in the previous section shows that gradient descent defines a spectral filtering function where . More precisely, the following holds.

###### Proposition 1.

Assume for , then the polynomials related to the gradient descent iterates, defined in (5), are a filtering function with parameters and . Moreover it has qualification with parameters .

The above result is classical and we report a proof in the appendix for completeness. Next, we discuss analogous results for accelerate methods and then compare the different spectral filtering functions.

### 3.2 Spectral filtering for accelerated methods

For the heavy-ball (6) the following result holds

###### Proposition 2.

Assume , let and for , then the polynomials related to heavy-ball method (6) are a filtering function with parameters and . Moreover there exist a positive constant such that the -method has qualification .

The proof of the above proposition follows combining several intermediate results from [10]. The key idea is to show that the residual polynomials defined by heavy-ball iteration form a sequence of orthogonal polynomials with respect to the weight function

 ων(σ)=σ2νσ12(1−σ)12,

which is a so called shifted Jacobi weight. Results from orthogonal polynomials can then be used to characterize the corresponding spectral filtering function.
The following proposition considers Nesterov acceleration.

###### Proposition 3.

Assume , then the polynomials related to Nesterov iterates (7) are a filtering function with constants and . Moreover the qualification of this method is at least with constants .

Filtering properties of the Nesterov iteration (7) have been studied recently in the context of inverse problems [22]. In the appendix 7.3 we provide a simplified proof based on studying the properties of suitable discrete dynamical systems defined by the Nesterov iteration (7).

### 3.3 Comparing the different filter functions

We summarize the properties of the spectral filtering function of the various methods for .

Method Qualification Gradien descent 1 1 Heavy-ball 2 1 Nesterov 2 1

The main observation is that the properties of the spectral filtering functions corresponding to the different iterations depend on for gradient descent, but on for the accelerated methods. As we see in the next section this leads to substantially different learning properties. Further we can see that gradient descent is the only algorithm with qualification , even if the parameter can be very large. The accelerated methods seem to have smaller qualification. In particular, the heavy-ball method can attain a high qualification, depending on , but the constant is unknown and could be large. For Nesterov accelerated method, the qualification is at least and it’s an open question whether this bound is tight or higher qualification can be attained.

In the next section, we show how the properties of the spectral filtering functions can be exploited to study the excess risk of the corresponding iterations.

## 4 Learning properties for accelerated methods

We first consider a basic scenario and then a more refined analysis leading to a more general setting and potentially faster learning rates.

### 4.1 Attainable case

Consider the following basic assumption.

###### Assumption 1.

Assume there exist such that -almost surely and such that .

Then the following result can be derived.

###### Theorem 1.

Under Assumption 1, let and be the -th iterations respectively of gradient descent (5) and an accelerated version given by (6) or (7). Assuming the sample-size to be large enough and let then there exist two positive constant and such that with probability at least

 E(^wGDt)−infHE ≤C1(1t+tn)log22δ E(^wacct)−infHE ≤C2(1t2+t2n)log22δ.

where the constants and do not depend on , but depend on the chosen optimization method.
Moreover by choosing the stopping rules and both algorithms have learning rate of order .

The proof of the above results is given in the appendix and the novel part is the one concerning accelerated methods, particularly Nesterov acceleration. The result shows how the number of iteration controls the learning properties both for gradient descent and accelerated gradient. In this sense implicit regularization occurs in all these approaches. For any the error is split in two contributions. Inspecting the proof it is easy to see that, the first term in the bound comes from the convergence properties of the algorithm with infinite data. Hence the optimization error translates into a bias term. The decay for accelerated method is much faster than for gradient descent. The second term arises from comparing the empirical iterates with their infinite sample (population) limit. It is a variance term depending on the sampling in the data and hence decreases with the sample size. For all methods, this term increases with the number of iterations, indicating that the empirical and population iterations are increasingly different. However, the behavior is markedly worse for accelerated methods. The benefit of acceleration seems to be balanced out by this more unstable behavior. In fact, the benefit of acceleration is apparent balancing the error terms to obtain a final bound. The obtained bound is the same for gradient descent and accelerated methods, and is indeed optimal since it matches corresponding lower bounds [3, 7]. However, the number of iterations needed by accelerated methods is the square root of those needed by gradient descent, indicating a substantial computational gain can be attained. Next we show how these results can be generalized to a more general setting, considering both weaker and stronger assumptions, corresponding to harder or easier learning problems.

### 4.2 More refined result

Theorem 1 is a simplified version of the more general result that we discuss in this section. We are interested in covering also the non-attainable case, that is when there is no such that . In order to cover this case we have to introduce several more definitions and notations. In Appendix 8.2 we give a more detailed description of the general setting. Consider the space of the square integrable functions with the norm and extend the expected risk to defining . Let be the hypothesis space of functions such that almost surely. Recall that, the minimizer of the expected risk over is the regression function . The projection over the closure of the hypothesis space is defined as

 fH=argming∈¯¯¯¯H∥∥g−fρ∥∥ρX.

Let be the integral operator

 Lf(x)=∫Xf(x′)⟨x,x′⟩dρX(x′).

The first assumption we consider concern the moments of the output variable and is more general than assuming the output variable

to be bounded as assumed before.

###### Assumption 2.

There exist positive constant and such that for all ,

 ∫R|y|ldρ(y|x)≤12l!Ml−2Q2ρX almost surely.

This assumption is standard and satisfied in classification or regression with well behaved noise. Under this assumption the regression function is bounded almost surely

 |fρ(x)|≤∫R|y|dρ(y|x)≤(∫R|y|2dρ(y|x))1/2≤Q. (12)

The next assumptions are related to the regularity of the target function .

###### Assumption 3.

There exist a positive constant such that the target function satisfy

 ∫X(fH(x)−fρ(x))2x⊗xdρX(x)⪯B2Σ.

This assumption is needed to deal with the misspecification of the model. The last assumptions quantify the regularity of and the size (capacity) of the space .

###### Assumption 4.

There exist and such that

 fH=Lrg0, with ∥g0∥ρX≤R.

Moreover we assume that there exist and a positive constant such that the effective dimension

 N(λ)=Tr(L(L+λI)−1)≤cγλ−1γ.

The assumption on is always true for and

and it’s satisfied when the eigenvalues

of decay as . We recall that, the space can be characterized in terms of the operator , indeed

 H=L1/2(L2ρX).

Hence, the non-attainable corresponds to considering .

###### Theorem 2.

Under Assumption 2, 3, 4, let and be the -th iterations of gradient descent (5) and an accelerated version given by (6) or (7) respectively. Assuming the sample-size to be large enough, let and assuming to be smaller than the qualification of the considered algorithm, then there exist two positive constant and such that with probability at least

 E(^wGDt)−infHE ≤C1⎛⎝1t2r+t1γn⎞⎠log22δ E(^wacct)−infHE ≤C2⎛⎝1t4r+t2γn⎞⎠log22δ.

where the constants and do not depend on , but depend on the chosen optimization.
Choosing the stopping rules and both gradient descent and accelerated methods achieve a learning rate of order .

The proof of the above result is given in the appendix. The general structure of the bound is the same as in the basic setting, which is now recovered as a special case. However, in this more general form, the various terms in the bound depend now on the regularity assumptions on the problem. In particular, the variance depends on the effective dimension behavior, e.g. on the eigenvalue decay, while the bias depend on the regularity assumption on . The general comparison between gradient descent and accelerated methods follows the same line as in the previous section. Faster bias decay of accelerated methods is contrasted by a more unstable behavior. As before, the benefit of accelerated methods becomes clear when deriving optimal stopping time and corresponding learning bound: they achieve the accuracy of gradient methods but in considerable less time. While heavy-ball and Nesterov have again similar behaviors, here a subtle difference resides in their different qualifications, which in principle lead to different behavior for easy problems, that is for large and . In this regime, gradient descent could work better since it has infinite qualification. For problems in which and the rates are worse than in the basic setting, hence these problems are hard.

## 5 Numerical simulation

In this section we show some numerical simulations to validate our results. We want to simulate the case in which the eigenvalues of the operator are for some and the non-attainable case . In order to do this we observe that if we consider the kernel setting over a finite space of size

with the uniform probability distribution

, then the space becomes with the usual scalar product multiplied by . the operator becomes a matrix which entries are for every , where is the kernel, which is fixed by the choice of the matrix . We build the matrix with orthogonal matrix and diagonal matrix with entries . The source condition becomes for some . We simulate the observed output as where

is the zero-mean normal distribution of variance

. The sampling operation can be seen as extracting indices and building the kernel matrix and the noisy labels for every . The Representer Theorem ensure that we can built our estimator as where the vector depends on the chosen optimization algorithm and takes the form . The excess risk of the estimator is given by .
For every algorithm considered, we run 50 repetitions, in which we sample the data-space and compute the error , where represents the estimator related to the -th iteration of one of the considered algorithms, and in the end we compute the mean and the variance of those errors.

In Figure 1 we simulate the error of all the algorithms considered for both attainable and non-attainable case. We observe that both heavy-ball and Nesterov acceleration provides faster convergence rates with respect to gradient descent method, but the learning accuracy is not improved. Moreover we observe that the accelerated methods considered show similar behavior.

In Figure 2 we show the test error related to the real dataset pumadyn8nh (available at https://www.dcc.fc.up.pt/ ltorgo/Regression/puma.html). Even in this case we can observe the behaviors shown in our theoretical results.

## 6 Conclusion

In this paper, we have considered the implicit regularization properties of accelerated gradient methods for least squares in Hilbert space. Using spectral calculus we have characterized the properties of the different iterations in terms of suitable polynomials. Using the latter, we have derived error bounds in terms of suitable bias and variance terms. The main conclusion is that under the considered assumptions accelerated methods have smaller bias but also larger variance. As a byproduct they achieve the same accuracy of vanilla gradient descent but with much fewer iterations. Our study opens a number of potential theoretical and empirical research directions. From a theory point of view, it would be interesting to consider other learning regimes. For examples classification problems or other regularity assumptions beyond classical nonparametric assumptions, e..g. misspecified models and fast eigenvalues decays (Gaussian kernel). From an empirical point of view it would be interesting to do a more thorough investigation on a larger number of simulated and real data-sets of varying dimension.

## References

• [1] Hedy Attouch, Zaki Chbani, and Hassan Riahi. Rate of convergence of the nesterov accelerated gradient method in the subcritical case . ESAIM: Control, Optimisation and Calculus of Variations, 25:2, 2019.
• [2] Luca Baldassarre, Lorenzo Rosasco, Annalisa Barla, and Alessandro Verri. Multi-output learning via spectral filtering. Machine learning, 87(3):259–301, 2012.
• [3] Gilles Blanchard and Nicole Mücke. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18(4):971–1013, Aug 2018.
• [4] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.
• [5] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
• [6] Peter Bühlmann and Bin Yu. Boosting with the l 2 loss: regression and classification. Journal of the American Statistical Association, 98(462):324–339, 2003.
• [7] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
• [8] Yuansi Chen, Chi Jin, and Bin Yu. Stability and convergence trade-off of iterative optimization algorithms. arXiv preprint arXiv:1804.01619, 2018.
• [9] Olivier Devolder, François Glineur, and Yurii Nesterov. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1-2):37–75, 2014.
• [10] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996.
• [11] Junichi Fujii, Masatoshi Fujii, Takayuki Furuta, and Ritsuo Nakamoto. Norm inequalities equivalent to heinz inequality. Proceedings of the American Mathematical Society, 118(3):827–830, 1993.
• [12] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018.
• [13] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
• [14] Yann A LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
• [15] Junhong Lin, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit regularization for multiple passes sgm. In International Conference on Machine Learning, pages 2340–2348, 2016.
• [16] Junhong Lin and Volkan Cevher. Optimal convergence for distributed learning with stochastic gradient methods and spectral-regularization algorithms. stat, 1050:22, 2018.
• [17] Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 4556–4564, 2016.
• [18] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral algorithms with least-squares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 2018.
• [19] Peter Mathé and Sergei Pereverzev. Regularization of some linear ill-posed problems with discretized random noisy data. Mathematics of Computation, 75(256):1913–1929, 2006.
• [20] Peter Mathé and Sergei V Pereverzev. Moduli of continuity for operator valued functions. Numerical Functional Analysis and Optimization, 23(5-6):623–631, 2002.
• [21] Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983.
• [22] Andreas Neubauer. On nesterov acceleration for landweber iteration of linear ill-posed problems. Journal of Inverse and Ill-posed Problems, 25(3):381–390, 2017.
• [23] Boris T Polyak. Introduction to optimization. Technical report, 1987.
• [24] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and non-parametric regression: an optimal data-dependent stopping rule. The Journal of Machine Learning Research, 15(1):335–366, 2014.
• [25] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1630–1638, 2015.
• [26] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
• [27] Ingo Steinwart and Andreas Christmann. Springer Science & Business Media, 2008.
• [28] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
• [29] Gabor Szeg. Orthogonal polynomials, volume 23. American Mathematical Soc., 1939.
• [30] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
• [31] SK Zavriev and FV Kostyuk. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336–341, 1993.
• [32] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

## 7 Appendix: regularization properties for accelerated algorithms

### 7.1 Regularization properties for gradient descent

Proof of Proposition 1

###### Proof.

Since and is chosen such that it holds that for every and so for the definitions of and it holds

 supσ∈(0,κ2]|gt(σ)|≤αt supσ∈(0,κ2]|rt(σ)|≤1

hence Landweber polynomials verify (9) and (10) with and .
For what concern the qualification of this method, for every the maximum of the function is attained at , so we get

hence we prove (11) for every with and complete the proof.

### 7.2 Regularization properties for heavy-ball

For the sake of simplicity assume . Before proceeding with the analysis of the -method we state one lemma, which will be useful in the following.

###### Lemma 1.

Let be a family of polynomials of degree with and the associated residuals.
Assume the residuals satisfy

 |rt(σ)|≤1∀σ∈[0,1],t∈N (13)

then it holds that

 |gt(σ)|≤2t2∀σ∈[0,1].
###### Proof.

Using the definition of the residual and the Mean Value Theorem there exist such that

 gt(σ)=1−rt(σ)σ=rt(0)−rt(σ)σ=−r′t(¯σ).

where denotes the first derivative of .
Markov’s inequality for polynomials implies that

 supσ∈[0,1]|r′t(σ)|≤2t2supσ∈[0,1]|rt(σ)|,

hence it holds

 gt(σ)≤sup¯σ∈[0,1]|r′t(¯σ)|≤2t2.

Fixed the residual polynomials associated to the -method form a sequence of orthogonal polynomials with respect to the weight function

 ων(σ)=σ2νσ12(1−σ)12,

which is a shifted Jacobi weight, hence the residual polynomials are normalized shifted copies of Jacobi polynomials, where the normalization is due to the constraint .
Thanks to the properties of orthogonal polynomials, they satisfy Christoffel-Darboux recurrence formula (see e.g. [29])

 rt+1=rt(σ)+βt+1(rt(σ)−rt−1(σ))−αt+1σrt(σ),t≥1

and a straightforward computation shows that this recursion on our problem carries over to the iterates of the associated method

 ^wt+1=^wt−αt+1X∗(X^wt−y)+βt+1(^wt−^wt−1).

where, for every , the parameters are defined by

 αt= 4(2t+2ν−1)(t+ν−1)(t+2ν−1)(2t+4ν−1) βt= (t−1)(2t−3)(2t+2ν−1)(t+2ν−1)(2t+4ν−1)(2t+2ν−3),

with initialization .
In particular it holds the following result from [10].

###### Theorem 3.

The residual polynomials of the -method ( fixed) are uniformely bounded for all ,

 |rt(σ)|≤1σ∈[0,1];

they further satisfy

 |σνrt(σ)|≤cνt−2ν∀σ∈[0,1] (14)

with appropriate constants .

Proof of Proposition 2

###### Proof.

Theorem 3 states that (10) holds true with and that the qualification of the method is . Moreover by the Lemma 1 we get that also that (9) holds with .

### 7.3 Regularization properties for Nesterov’s acceleration

Nesterov iterates (7) can be written as

 ^wt+1 =^wt+βt(^wt−^wt−1)−αX∗(X(^wt+βt(^wt−^wt−1))−y)= =[(βt+1)(I−α^Σ)]^wt+[−βt(I−α^Σ)]^wt−1+αX∗y

and since it can be easily proved that the polynomials and the residual satisfy the following recursions

 gt+1(σ) =(1−ασ)[gt(σ)+βt(gt(σ)−gt−1(σ))]+α (15) rt+1(σ) =(1−ασ)[rt(σ)+βt(rt(σ)−rt−1(σ))] (16)

for every with initialization and .
Moreover, we can rewrite (16) as

 rt+1(σ)=(1−ασ)[(1−θt)rt(σ)+θt(rt−1(σ)+1θt−1(rt(σ)−rt−1(σ)))] (17)

where for every is defined such that

 βt=θtθt−1(1−θt−1),

in particular, the choice (8) implies

 θt=βt+β. (18)

With these choices we can state a first proposition about the properties of the residual polynomials of the Nesterov’s accelerated method.

###### Proposition 4.

Let satisfy the recursion (17) where the step-size is chosen such that and defined in (18), the for all

 σr|rt(σ)|≤(β2α)rt−2r (19)

for all .

###### Proof.

Let , following [22] we can see the right hand of (17) as a convex combination between and

 Rt(σ)=rt−1(σ)+1θt−1(rt(σ)−rt−1(σ)).

We can observe that polynomials and satisfy the following recursions

 rt+1(σ) =(1−ασ)(1−θt)rt(σ)+θt(1−ασ)Rt(σ) Rt+1(σ) =−ασθt(1−θt)rt(σ)+(1−ασ)Rt(σ)

By computing the square of the polynomials and rescaling them in order to get the two mixed term to be opposite, we obtain that

 ασθ2tr2t+1(σ)+(1−ασ)R2t+1=(1−ασ)[(1−θt)2ασθ2tr2t(σ)+(1−ασ)R2t(σ)]

We can observe that parameters satisfy the following

 1≥θt≥θt−11+θt−1

which implies

 (1−θ2t)θ2t≤1θ2t−1.

Hence we get that

 ασθ2tr2t+1(σ)+(1−ασ)R2t+1≤ (1−ασ)[ασθ2t−1r2t(σ)+(1−ασ)R2t(σ)] ≤ (1−ασ)t[ασθ20r21(σ)+(1−ασ)R21(σ)]

where the second inequality follows by induction.
Finally, using that and , yields that

 ασθ2t−1r2t(σ)+(1−ασ)R2t≤(1−ασ)t+1.

This inequality implies that both the terms in the sum are smaller that , hence

 |Rt(σ)|≤1 (20) σr2t(σ)≤θ2t−1α(1−ασ)t+1 (21)

By induction it follows from (20) that (19) holds for :

 |rt(σ)|≤1

because is a convex combination of and multiplied by .
While (21) implies (19) for . The remaining cases

By a scaled version of Lemma 1 it holds that

 |gt(σ)|≤2αt2∀σ∈[0,κ2].

Proof of Proposition 3

###### Proof.

The proof follows immediately by the above results. ∎

## 8 Appendix: generalization bound via spectral/regularized algorithm

### 8.1 Learning with kernels

The setting in this paper recover non-parametric regression over a RKHS as a special case. Let be a probability space with distribution , the goal is to minimize the risk

 E(f)=∫Ξ×R(y−f(ξ))2dμ(ξ,y).

A common way to build an estimator is to consider a symmetric kernel which is positive definite, which means that for every and the matrix with the entries for . This kernel defines a unique Hilbert space of function with the inner product and such that for all , and the following reproducing property holds for all , . By introducing the feature map defined by , and we further consider , where , which provide the probability distribution . Denoting and we come back to our previous setting, in fact by the change of variable we have

 ∫Ξ×R(y−f(ξ))2dμ(ξ,y)=∫Ξ×R(y−⟨f,Kξ⟩K)2dμ(ξ,y)=∫X×R(y−⟨f,x⟩K)2dρ(x,y).

### 8.2 Mathematical setting

Let’s consider the hypothesis space

 H={f:X→R|∃w∈X with f(x)=⟨w,x⟩ρX-% almost surely},

which under assumptio 1 is a subspace of the Hilbert space of the square integral functions from to with respect to the measure

 L2ρX={f:X→R|∥f∥2ρX=⟨f,f⟩ρX:=∫Xf(x)2dρX(x)<+∞}.

The function that minimizes the expected risk over all possible measurable functions is the regression function [27].

 fρ=argminf:X→R