Log In Sign Up

When Does Preconditioning Help or Hurt Generalization?

by   Shun-ichi Amari, et al.

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization remains controversial. For instance, it has been pointed out that gradient descent (GD), in contrast to second-order optimizers, converges to solutions with small Euclidean norm in many overparameterized models, leading to favorable generalization properties. In this work, we question the common belief that first-order optimizers generalize better. We provide a precise asymptotic bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner P, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We characterize the optimal P for the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the "shape" of the signal (true parameters). Specifically, when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can generalize better than GD. Conversely, in the setting with clean labels, a well-specified model, and well-aligned signal, GD achieves better generalization. Based on this analysis, we consider several approaches to manage the bias-variance tradeoff, and find that interpolating between GD and NGD may generalize better than either algorithm. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. In our empirical comparisons of first- and second-order optimization of neural networks, we observe robust trends matching our theoretical analysis.


page 1

page 2

page 3

page 4


Limitations of the Empirical Fisher Approximation

Natural gradient descent, which preconditions a gradient descent update ...

A Stein variational Newton method

Stein variational gradient descent (SVGD) was recently proposed as a gen...

Rethinking Bias-Variance Trade-off for Generalization of Neural Networks

The classical bias-variance trade-off predicts that bias decreases and v...

Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian

Modern neural network architectures often generalize well despite contai...

Understanding Generalization and Optimization Performance of Deep CNNs

This work aims to provide understandings on the remarkable success of de...

Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression

We consider the optimization of a quadratic objective function whose gra...