1 Introduction
The focus on optimization is a major trend in modern machine learning, where efficiency is mandatory in large scale problems
[4]. Among other solutions, first order methods have emerged as methods of choice. While these techniques are known to have potentially slow convergence guarantees, they also have low iteration costs, ideal in large scale problems. Consequently the question of accelerating first order methods while keeping their small iteration costs have received much attention, see e.g. [32]. Since machine learning solutions are typically derived minimizing an empirical objective (the training error), most theoretical studies have focused on the error estimated for this latter quantity. However, it has recently become clear that optimization can play a key role from a statistical point of view when the goal is to minimize the expected (test) error. On the one hand, iterative optimization implicitly bias the search for a solution, e.g. converging to suitable minimal norm solutions
[26]. On the other hand, the number of iterations parameterize paths of solutions of different complexity [30].The idea that optimization can implicitly perform regularization has a long history. In the context of linear inverse problems, it is known as iterative regularization [10]
. It is also an old trick for training neural networks where it is called early stopping
[14]. The question of understanding the generalization properties of deep learning applications has recently sparked a lot of attention on this approach, which has be referred to as implicit regularization, see e.g.
[12]. Establishing the regularization properties of iterative optimization requires the study of the corresponding expected error by combining optimization and statistical tools. First results in this sense focused on linear least squares with gradient descent and go back to [6, 30], see also [24]and references there in for improvements. Subsequent works have started considering other loss functions
[15], multilinear models [12] and other optimization methods, e.g. stochastic approaches [25, 17, 13].In this paper, we consider the implicit regularization properties of acceleration. We focus on linear least squares in Hilbert space, because this setting allows to derive sharp results and working in infinite dimension magnify the role of regularization. Unlike in finite dimension learning bounds are possible only if some form of regularization is considered. In particular, we consider two of the most popular accelerated gradient approaches, based on Nesterov acceleration [21] and (a variant of) the heavyball method [23]
. Both methods achieve acceleration by exploiting a so called momentum term, which uses not only the previous, but the previous two iterations at each step. Considering a suitable biasvariance decomposition, our results show that accelerated methods have a behavior qualitatively different from basic gradient descent. While the bias decays faster with the number of iterations, the variance increases faster too. The two effect balance out, showing that accelerated methods achieve the same optimal statistical accuracy of gradient descent but they can indeed do this with less computations. Our analysis takes advantage of the linear structures induced by least squares to exploit tools from spectral theory. Indeed, the characterization of both convergence and stability rely on the study of suitable spectral polynomials defined by the iterates. While the idea that accelerated methods can be more unstable, this has been pointed out in
[9] in a pure optimization context. Our results quantify this effect from a statistical point of view. Close to our results is the study in [8], where a stability approach is considered to analyze gradient methods for different loss functions [5].2 Learning with (accelerated) gradient methods
Let the input space be a separable Hilbert space (with scalar product and induced norm ) and the output space be ^{1}^{1}1As shown in Appendix this choice allows to recover nonparametric kernel learning as a special case.. Let
be a unknown probability measure on the inputoutput space
, the induced marginal probability on , and the conditional probability measure on given . We make the following standard assumption: there exist such that(1) 
The goal of leastsquares linear regression is to solve the
expected risk minimization problem(2) 
where is known only through the i.i.d. samples . In the following, we measure the quality of an approximate solution with the excess risk
The search of a solution is often based on replacing (2) with the empirical risk minimization (ERM)
(3) 
For least squares an ERM solution can be computed in closed form using a direct solver. However, for large problems, iterative solvers are preferable and we next describe the approaches we consider.
First, it is useful to rewrite the ERM with vectors notation. Let
with and s.t. for . Here the norm is norm in multiplied by . Let be the adjoint of defined by . Then, ERM becomes(4) 
2.1 Gradient descent and accelerated methods
Gradient descent serves as a reference approach throughout the paper. For problem (4) it becomes
(5) 
with initial point and the stepsize that satisfy ^{2}^{2}2 The stepsize is the stepsize at the th iteration and satisfies the condition where denotes the operatorial norm. Since the operator is bounded by (which means ) it is sufficient to assume . The progress made by gradient descent at each iteration can be slow and the idea behind acceleration is to use the information of the previous directions in order to improves the convergence rate of the algorithm.
Heavyball
Heavyball is a popular accelerated method that adds the momentum at each iteration
(6) 
with , the case reduces to gradient descent. In the quadratic case we consider it is also called Chebyshev iterative method. The optimization properties of heavyball have been studied extensively [23, 31]. Here, we consider the following variant. Let , consider the varying parameter heavyball replacing in (6) with defined as:
for and with initialization . With this choice and considering the leastsquares problem this algorithm is known as method in the inverse problem literature (see e.g. [10]). This seemingly complex parameters’ choice allows to relates the approach to suitable orthogonal polynomials recursion as we discuss later.
Nesterov acceleration
The second form of gradient acceleration we consider is the popular Nesterov acceleration [21]. In our setting, it corresponds to the iteration
(7) 
with the two initial points , and the sequence chosen as
(8) 
Differently from heavyball, Nesterov acceleration uses the momentum term also in the evaluation of the gradient. Also in this case optimization results are well known [1, 28].
3 Spectral filtering for accelerated methods
Least squares allows to consider spectral approaches to study the properties of gradient methods for learning. We illustrate these ideas for gradient descent before considering accelerated methods.
Gradient descent as spectral filtering
Note that by a simple (and classical) induction argument, gradient descent can be written as
Equivalently using spectral calculus
where are the polynomials for all and . Note that, the polynomials are bounded by . A first observation is that converges to as , since converges to . A second observation is that the residual polynomials , which are all bounded by , control ERM convergence since,
In particular, if y is in the range of for some (source condition on y) improved convergence rates can be derived noting that by an easy calculation
As we show in Section 4, considering the polynomials and allows to study not only ERM but also expected risk minimization (2), by relating gradient methods to their infinite sample limit. Further, we show how similar reasoning hold for accelerated methods. In order to do so, it useful to first define the characterizing properties of and .
3.1 Spectral filtering
The following definition abstracts the key properties of the function and often called spectral filtering function [2]. Following the classical definition we replace with a generic parameter .
Definition 1.
x
The family is called spectral filtering function if the following conditions hold:
 (i)

There exist a constant such that, for any
(9)  (ii)

Let there exist a constant such that, for any
(10)
Definition 2.
(Qualification)
The qualification of the spectral filtering function is the maximum parameter such that for any
there exist a constant such that
(11) 
Moreover we say that a filtering function has qualification if (11) holds for every .
Methods with finite qualification might have slow convergence rates in certain regimes. The smallest the qualification the worse the rates can be.
The discussion in the previous section shows that gradient descent defines a spectral filtering function where . More precisely, the following holds.
Proposition 1.
Assume for , then the polynomials related to the gradient descent iterates, defined in (5), are a filtering function with parameters and . Moreover it has qualification with parameters .
The above result is classical and we report a proof in the appendix for completeness. Next, we discuss analogous results for accelerate methods and then compare the different spectral filtering functions.
3.2 Spectral filtering for accelerated methods
For the heavyball (6) the following result holds
Proposition 2.
Assume , let and for , then the polynomials related to heavyball method (6) are a filtering function with parameters and . Moreover there exist a positive constant such that the method has qualification .
The proof of the above proposition follows combining several intermediate results from [10]. The key idea is to show that the residual polynomials defined by heavyball iteration form a sequence of orthogonal polynomials with respect to the weight function
which is a so called shifted Jacobi weight. Results from orthogonal polynomials can then be used to characterize the corresponding spectral filtering function.
The following proposition considers Nesterov acceleration.
Proposition 3.
Assume , then the polynomials related to Nesterov iterates (7) are a filtering function with constants and . Moreover the qualification of this method is at least with constants .
3.3 Comparing the different filter functions
We summarize the properties of the spectral filtering function of the various methods for .
The main observation is that the properties of the spectral filtering functions corresponding to the different iterations depend on for gradient descent, but on for the accelerated methods. As we see in the next section this leads to substantially different learning properties. Further we can see that gradient descent is the only algorithm with qualification , even if the parameter can be very large. The accelerated methods seem to have smaller qualification. In particular, the heavyball method can attain a high qualification, depending on , but the constant is unknown and could be large. For Nesterov accelerated method, the qualification is at least and it’s an open question whether this bound is tight or higher qualification can be attained.
In the next section, we show how the properties of the spectral filtering functions can be exploited to study the excess risk of the corresponding iterations.
4 Learning properties for accelerated methods
We first consider a basic scenario and then a more refined analysis leading to a more general setting and potentially faster learning rates.
4.1 Attainable case
Consider the following basic assumption.
Assumption 1.
Assume there exist such that almost surely and such that .
Then the following result can be derived.
Theorem 1.
Under Assumption 1, let and be the th iterations respectively of gradient descent (5) and an accelerated version given by (6) or (7). Assuming the samplesize to be large enough and let then there exist two positive constant and such that with probability at least
where the constants and do not depend on , but depend on the chosen optimization method.
Moreover by choosing the stopping rules and both algorithms have learning rate of order .
The proof of the above results is given in the appendix and the novel part is the one concerning accelerated methods, particularly Nesterov acceleration. The result shows how the number of iteration controls the learning properties both for gradient descent and accelerated gradient. In this sense implicit regularization occurs in all these approaches. For any the error is split in two contributions. Inspecting the proof it is easy to see that, the first term in the bound comes from the convergence properties of the algorithm with infinite data. Hence the optimization error translates into a bias term. The decay for accelerated method is much faster than for gradient descent. The second term arises from comparing the empirical iterates with their infinite sample (population) limit. It is a variance term depending on the sampling in the data and hence decreases with the sample size. For all methods, this term increases with the number of iterations, indicating that the empirical and population iterations are increasingly different. However, the behavior is markedly worse for accelerated methods. The benefit of acceleration seems to be balanced out by this more unstable behavior. In fact, the benefit of acceleration is apparent balancing the error terms to obtain a final bound. The obtained bound is the same for gradient descent and accelerated methods, and is indeed optimal since it matches corresponding lower bounds [3, 7]. However, the number of iterations needed by accelerated methods is the square root of those needed by gradient descent, indicating a substantial computational gain can be attained. Next we show how these results can be generalized to a more general setting, considering both weaker and stronger assumptions, corresponding to harder or easier learning problems.
4.2 More refined result
Theorem 1 is a simplified version of the more general result that we discuss in this section. We are interested in covering also the nonattainable case, that is when there is no such that . In order to cover this case we have to introduce several more definitions and notations. In Appendix 8.2 we give a more detailed description of the general setting. Consider the space of the square integrable functions with the norm and extend the expected risk to defining . Let be the hypothesis space of functions such that almost surely. Recall that, the minimizer of the expected risk over is the regression function . The projection over the closure of the hypothesis space is defined as
Let be the integral operator
The first assumption we consider concern the moments of the output variable and is more general than assuming the output variable
to be bounded as assumed before.Assumption 2.
There exist positive constant and such that for all ,
This assumption is standard and satisfied in classification or regression with well behaved noise. Under this assumption the regression function is bounded almost surely
(12) 
The next assumptions are related to the regularity of the target function .
Assumption 3.
There exist a positive constant such that the target function satisfy
This assumption is needed to deal with the misspecification of the model. The last assumptions quantify the regularity of and the size (capacity) of the space .
Assumption 4.
There exist and such that
Moreover we assume that there exist and a positive constant such that the effective dimension
The assumption on is always true for and
and it’s satisfied when the eigenvalues
of decay as . We recall that, the space can be characterized in terms of the operator , indeedHence, the nonattainable corresponds to considering .
Theorem 2.
Under Assumption 2, 3, 4, let and be the th iterations of gradient descent (5) and an accelerated version given by (6) or (7) respectively. Assuming the samplesize to be large enough, let and assuming to be smaller than the qualification of the considered algorithm, then there exist two positive constant and such that with probability at least
where the constants and do not depend on , but depend on the chosen optimization.
Choosing the stopping rules and both gradient descent and accelerated methods achieve a learning rate of order .
The proof of the above result is given in the appendix. The general structure of the bound is the same as in the basic setting, which is now recovered as a special case. However, in this more general form, the various terms in the bound depend now on the regularity assumptions on the problem. In particular, the variance depends on the effective dimension behavior, e.g. on the eigenvalue decay, while the bias depend on the regularity assumption on . The general comparison between gradient descent and accelerated methods follows the same line as in the previous section. Faster bias decay of accelerated methods is contrasted by a more unstable behavior. As before, the benefit of accelerated methods becomes clear when deriving optimal stopping time and corresponding learning bound: they achieve the accuracy of gradient methods but in considerable less time. While heavyball and Nesterov have again similar behaviors, here a subtle difference resides in their different qualifications, which in principle lead to different behavior for easy problems, that is for large and . In this regime, gradient descent could work better since it has infinite qualification. For problems in which and the rates are worse than in the basic setting, hence these problems are hard.
5 Numerical simulation
In this section we show some numerical simulations to validate our results. We want to simulate the case in which the eigenvalues of the operator are for some and the nonattainable case . In order to do this we observe that if we consider the kernel setting over a finite space of size
with the uniform probability distribution
, then the space becomes with the usual scalar product multiplied by . the operator becomes a matrix which entries are for every , where is the kernel, which is fixed by the choice of the matrix . We build the matrix with orthogonal matrix and diagonal matrix with entries . The source condition becomes for some . We simulate the observed output as whereis the zeromean normal distribution of variance
. The sampling operation can be seen as extracting indices and building the kernel matrix and the noisy labels for every . The Representer Theorem ensure that we can built our estimator as where the vector depends on the chosen optimization algorithm and takes the form . The excess risk of the estimator is given by .For every algorithm considered, we run 50 repetitions, in which we sample the dataspace and compute the error , where represents the estimator related to the th iteration of one of the considered algorithms, and in the end we compute the mean and the variance of those errors.
In Figure 1 we simulate the error of all the algorithms considered for both attainable and nonattainable case. We observe that both heavyball and Nesterov acceleration provides faster convergence rates with respect to gradient descent method, but the learning accuracy is not improved. Moreover we observe that the accelerated methods considered show similar behavior.
In Figure 2 we show the test error related to the real dataset pumadyn8nh (available at https://www.dcc.fc.up.pt/ ltorgo/Regression/puma.html). Even in this case we can observe the behaviors shown in our theoretical results.
6 Conclusion
In this paper, we have considered the implicit regularization properties of accelerated gradient methods for least squares in Hilbert space. Using spectral calculus we have characterized the properties of the different iterations in terms of suitable polynomials. Using the latter, we have derived error bounds in terms of suitable bias and variance terms. The main conclusion is that under the considered assumptions accelerated methods have smaller bias but also larger variance. As a byproduct they achieve the same accuracy of vanilla gradient descent but with much fewer iterations. Our study opens a number of potential theoretical and empirical research directions. From a theory point of view, it would be interesting to consider other learning regimes. For examples classification problems or other regularity assumptions beyond classical nonparametric assumptions, e..g. misspecified models and fast eigenvalues decays (Gaussian kernel). From an empirical point of view it would be interesting to do a more thorough investigation on a larger number of simulated and real datasets of varying dimension.
References
 [1] Hedy Attouch, Zaki Chbani, and Hassan Riahi. Rate of convergence of the nesterov accelerated gradient method in the subcritical case . ESAIM: Control, Optimisation and Calculus of Variations, 25:2, 2019.
 [2] Luca Baldassarre, Lorenzo Rosasco, Annalisa Barla, and Alessandro Verri. Multioutput learning via spectral filtering. Machine learning, 87(3):259–301, 2012.
 [3] Gilles Blanchard and Nicole Mücke. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18(4):971–1013, Aug 2018.
 [4] Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in neural information processing systems, pages 161–168, 2008.
 [5] Olivier Bousquet and André Elisseeff. Stability and generalization. Journal of machine learning research, 2(Mar):499–526, 2002.
 [6] Peter Bühlmann and Bin Yu. Boosting with the l 2 loss: regression and classification. Journal of the American Statistical Association, 98(462):324–339, 2003.
 [7] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized leastsquares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
 [8] Yuansi Chen, Chi Jin, and Bin Yu. Stability and convergence tradeoff of iterative optimization algorithms. arXiv preprint arXiv:1804.01619, 2018.
 [9] Olivier Devolder, François Glineur, and Yurii Nesterov. Firstorder methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(12):37–75, 2014.
 [10] Heinz Werner Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996.
 [11] Junichi Fujii, Masatoshi Fujii, Takayuki Furuta, and Ritsuo Nakamoto. Norm inequalities equivalent to heinz inequality. Proceedings of the American Mathematical Society, 118(3):827–830, 1993.
 [12] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461–9471, 2018.
 [13] Moritz Hardt, Benjamin Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240, 2015.
 [14] Yann A LeCun, Léon Bottou, Genevieve B Orr, and KlausRobert Müller. Efficient backprop. In Neural networks: Tricks of the trade, pages 9–48. Springer, 2012.
 [15] Junhong Lin, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit regularization for multiple passes sgm. In International Conference on Machine Learning, pages 2340–2348, 2016.
 [16] Junhong Lin and Volkan Cevher. Optimal convergence for distributed learning with stochastic gradient methods and spectralregularization algorithms. stat, 1050:22, 2018.
 [17] Junhong Lin and Lorenzo Rosasco. Optimal learning for multipass stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 4556–4564, 2016.
 [18] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral algorithms with leastsquares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 2018.
 [19] Peter Mathé and Sergei Pereverzev. Regularization of some linear illposed problems with discretized random noisy data. Mathematics of Computation, 75(256):1913–1929, 2006.
 [20] Peter Mathé and Sergei V Pereverzev. Moduli of continuity for operator valued functions. Numerical Functional Analysis and Optimization, 23(56):623–631, 2002.
 [21] Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o (1/k^ 2). In Dokl. akad. nauk Sssr, volume 269, pages 543–547, 1983.
 [22] Andreas Neubauer. On nesterov acceleration for landweber iteration of linear illposed problems. Journal of Inverse and Illposed Problems, 25(3):381–390, 2017.
 [23] Boris T Polyak. Introduction to optimization. Technical report, 1987.
 [24] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Early stopping and nonparametric regression: an optimal datadependent stopping rule. The Journal of Machine Learning Research, 15(1):335–366, 2014.
 [25] Lorenzo Rosasco and Silvia Villa. Learning with incremental iterative regularization. In Advances in Neural Information Processing Systems, pages 1630–1638, 2015.
 [26] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
 [27] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008.
 [28] Weijie Su, Stephen Boyd, and Emmanuel Candes. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems, pages 2510–2518, 2014.
 [29] Gabor Szeg. Orthogonal polynomials, volume 23. American Mathematical Soc., 1939.
 [30] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.
 [31] SK Zavriev and FV Kostyuk. Heavyball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336–341, 1993.
 [32] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
7 Appendix: regularization properties for accelerated algorithms
7.1 Regularization properties for gradient descent
Proof of Proposition 1
Proof.
Since and is chosen such that it holds that for every and so for the definitions of and it holds
hence Landweber polynomials verify (9) and (10) with and .
For what concern the qualification of this method, for every the maximum of the function is attained at , so we get
hence we prove (11) for every with and complete the proof.
∎
7.2 Regularization properties for heavyball
For the sake of simplicity assume . Before proceeding with the analysis of the method we state one lemma, which will be useful in the following.
Lemma 1.
Let be a family of polynomials of degree with and the associated residuals.
Assume the residuals satisfy
(13) 
then it holds that
Proof.
Using the definition of the residual and the Mean Value Theorem there exist such that
where denotes the first derivative of .
Markov’s inequality for polynomials implies that
hence it holds
∎
Fixed the residual polynomials associated to the method form a sequence of orthogonal polynomials with respect to the weight function
which is a shifted Jacobi weight, hence the residual polynomials are normalized shifted copies of Jacobi polynomials, where the normalization is due to the constraint .
Thanks to the properties of orthogonal polynomials, they satisfy ChristoffelDarboux recurrence formula (see e.g. [29])
and a straightforward computation shows that this recursion on our problem carries over to the iterates of the associated method
where, for every , the parameters are defined by
with initialization .
In particular it holds the following result from [10].
Theorem 3.
The residual polynomials of the method ( fixed) are uniformely bounded for all ,
they further satisfy
(14) 
with appropriate constants .
Proof of Proposition 2
7.3 Regularization properties for Nesterov’s acceleration
Nesterov iterates (7) can be written as
and since it can be easily proved that the polynomials and the residual satisfy the following recursions
(15)  
(16) 
for every with initialization and .
Moreover, we can rewrite (16) as
With these choices we can state a first proposition about the properties of the residual polynomials of the Nesterov’s accelerated method.
Proposition 4.
Proof.
Let , following [22] we can see the right hand of (17) as a convex combination between
and
We can observe that polynomials and satisfy the following recursions
By computing the square of the polynomials and rescaling them in order to get the two mixed term to be opposite, we obtain that
We can observe that parameters satisfy the following
which implies
Hence we get that
where the second inequality follows by induction.
Finally, using that and , yields that
This inequality implies that both the terms in the sum are smaller that , hence
(20)  
(21) 
By induction it follows from (20) that (19) holds for :
because is a convex combination of and multiplied by .
While (21) implies (19) for .
The remaining cases
follow by interpolation.
∎
By a scaled version of Lemma 1 it holds that
Proof of Proposition 3
Proof.
The proof follows immediately by the above results. ∎
8 Appendix: generalization bound via spectral/regularized algorithm
8.1 Learning with kernels
The setting in this paper recover nonparametric regression over a RKHS as a special case. Let be a probability space with distribution , the goal is to minimize the risk
A common way to build an estimator is to consider a symmetric kernel which is positive definite, which means that for every and the matrix with the entries for . This kernel defines a unique Hilbert space of function with the inner product and such that for all , and the following reproducing property holds for all , . By introducing the feature map defined by , and we further consider , where , which provide the probability distribution . Denoting and we come back to our previous setting, in fact by the change of variable we have
Comments
There are no comments yet.