Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?

08/26/2021
by   Dominic Richards, et al.
0

Modern methods for learning from data depend on many tuning parameters, such as the stepsize for optimization methods, and the regularization strength for regularized learning methods. Since performance can depend strongly on these parameters, it is important to develop comparisons between classes of methods, not just for particularly tuned ones. Here, we take aim to compare classes of estimators via the relative performance of the best method in the class. This allows us to rigorously quantify the tuning sensitivity of learning algorithms. As an illustration, we investigate the statistical estimation performance of ridge regression with a uniform grid of regularization parameters, and of gradient descent iterates with a fixed stepsize, in the standard linear model with a random isotropic ground truth parameter. (1) For orthogonal designs, we find the exact minimax optimal classes of estimators, showing they are equal to gradient descent with a polynomially decaying learning rate. We find the exact suboptimalities of ridge regression and gradient descent with a fixed stepsize, showing that they decay as either 1/k or 1/k^2 for specific ranges of k estimators. (2) For general designs with a large number of non-zero eigenvalues, we find that gradient descent outperforms ridge regression when the eigenvalues decay slowly, as a power law with exponent less than unity. If instead the eigenvalues decay quickly, as a power law with exponent greater than unity or exponentially, we find that ridge regression outperforms gradient descent. Our results highlight the importance of tuning parameters. In particular, while optimally tuned ridge regression is the best estimator in our case, it can be outperformed by gradient descent when both are restricted to being tuned over a finite regularization grid.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2022

The Implicit Regularization of Momentum Gradient Descent with Early Stopping

The study on the implicit regularization induced by gradient-based optim...
research
02/07/2019

Combining learning rate decay and weight decay with complexity gradient descent - Part I

The role of L^2 regularization, in the specific case of deep neural netw...
research
12/06/2020

Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods

Establishing a theoretical analysis that explains why deep learning can ...
research
02/04/2022

Elastic Gradient Descent and Elastic Gradient Flow: LARS Like Algorithms Approximating the Solution Paths of the Elastic Net

The elastic net combines lasso and ridge regression to fuse the sparsity...
research
01/21/2022

Tuned Regularized Estimators for Linear Regression via Covariance Fitting

We consider the problem of finding tuned regularized parameter estimator...
research
05/29/2019

The cost-free nature of optimally tuning Tikhonov regularizers and other ordered smoothers

We consider the problem of selecting the best estimator among a family o...
research
08/10/2021

The Benefits of Implicit Regularization from SGD in Least Squares Problems

Stochastic gradient descent (SGD) exhibits strong algorithmic regulariza...

Please sign up or login with your details

Forgot password? Click here to reset