On the Benefits of Large Learning Rates for Kernel Methods

02/28/2022
by   Gaspard Beugnot, et al.
0

This paper studies an intriguing phenomenon related to the good generalization performance of estimators obtained by using large learning rates within gradient descent algorithms. First observed in the deep learning literature, we show that a phenomenon can be precisely characterized in the context of kernel methods, even though the resulting optimization problem is convex. Specifically, we consider the minimization of a quadratic objective in a separable Hilbert space, and show that with early stopping, the choice of learning rate influences the spectral decomposition of the obtained solution on the Hessian's eigenvectors. This extends an intuition described by Nakkiran (2020) on a two-dimensional toy problem to realistic learning scenarios such as kernel ridge regression. While large learning rates may be proven beneficial as soon as there is a mismatch between the train and test objectives, we further explain why it already occurs in classification tasks without assuming any particular mismatch between train and test data distributions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2020

Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

Learning rate schedule can significantly affect generalization performan...
research
08/29/2023

Random feature approximation for general spectral methods

Random feature approximation is arguably one of the most popular techniq...
research
12/06/2020

Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods

Establishing a theoretical analysis that explains why deep learning can ...
research
06/29/2023

Solving Kernel Ridge Regression with Gradient-Based Optimization Methods

Kernel ridge regression, KRR, is a non-linear generalization of linear r...
research
06/03/2018

Analysis of regularized Nyström subsampling for regression functions of low smoothness

This paper studies a Nyström type subsampling approach to large kernel l...

Please sign up or login with your details

Forgot password? Click here to reset