Implicit Bias of MSE Gradient Optimization in Underparameterized Neural Networks

01/12/2022
by   Benjamin Bowman, et al.
0

We study the dynamics of a neural network in function space when optimizing the mean squared error via gradient flow. We show that in the underparameterized regime the network learns eigenfunctions of an integral operator T_K^∞ determined by the Neural Tangent Kernel (NTK) at rates corresponding to their eigenvalues. For example, for uniformly distributed data on the sphere S^d - 1 and rotation invariant weight distributions, the eigenfunctions of T_K^∞ are the spherical harmonics. Our results can be understood as describing a spectral bias in the underparameterized regime. The proofs use the concept of "Damped Deviations", where deviations of the NTK matter less for eigendirections with large eigenvalues due to the occurence of a damping factor. Aside from the underparameterized regime, the damped deviations point-of-view can be used to track the dynamics of the empirical risk in the overparameterized setting, allowing us to extend certain results in the literature. We conclude that damped deviations offers a simple and unifying perspective of the dynamics when optimizing the squared error.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2022

Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs

The training of neural networks by gradient descent methods is a corners...
research
10/08/2021

Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

Finding a quantitative theory of neural network generalization has long ...
research
06/02/2019

The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

We study the relationship between the speed at which a neural network le...
research
10/25/2020

A Dynamical View on Optimization Algorithms of Overparameterized Neural Networks

When equipped with efficient optimization algorithms, the over-parameter...
research
02/12/2023

From high-dimensional mean-field dynamics to dimensionless ODEs: A unifying approach to SGD in two-layers networks

This manuscript investigates the one-pass stochastic gradient descent (S...
research
02/06/2023

Rethinking Gauss-Newton for learning over-parameterized models

Compared to gradient descent, Gauss-Newton's method (GN) and variants ar...
research
07/13/2021

Geometry and Generalization: Eigenvalues as predictors of where a network will fail to generalize

We study the deformation of the input space by a trained autoencoder via...

Please sign up or login with your details

Forgot password? Click here to reset