
Limitations of the Empirical Fisher Approximation
Natural gradient descent, which preconditions a gradient descent update ...
read it

Whitening and second order optimization both destroy information about the dataset, and can make generalization impossible
Machine learning is predicated on the concept of generalization: a model...
read it

A Stein variational Newton method
Stein variational gradient descent (SVGD) was recently proposed as a gen...
read it

Overparameterization Improves Generalization in the XOR Detection Problem
Empirical evidence suggests that neural networks with ReLU activations g...
read it

Generalization Guarantees for Neural Networks via Harnessing the Lowrank Structure of the Jacobian
Modern neural network architectures often generalize well despite contai...
read it

What causes the test error? Going beyond biasvariance via ANOVA
Modern machine learning methods are often overparametrized, allowing ada...
read it

The Second Order Linear Model
We study a fundamental class of regression models called the second orde...
read it
When Does Preconditioning Help or Hurt Generalization?
While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization remains controversial. For instance, it has been pointed out that gradient descent (GD), in contrast to secondorder optimizers, converges to solutions with small Euclidean norm in many overparameterized models, leading to favorable generalization properties. In this work, we question the common belief that firstorder optimizers generalize better. We provide a precise asymptotic biasvariance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner P, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We characterize the optimal P for the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the "shape" of the signal (true parameters). Specifically, when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can generalize better than GD. Conversely, in the setting with clean labels, a wellspecified model, and wellaligned signal, GD achieves better generalization. Based on this analysis, we consider several approaches to manage the biasvariance tradeoff, and find that interpolating between GD and NGD may generalize better than either algorithm. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. In our empirical comparisons of first and secondorder optimization of neural networks, we observe robust trends matching our theoretical analysis.
READ FULL TEXT
Comments
There are no comments yet.