
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
Natural gradient descent has proven effective at mitigating the effects ...
read it

On the Explicit Role of Initialization on the Convergence and Implicit Bias of Overparametrized Linear Networks
Neural networks trained via gradient descent with random initialization ...
read it

Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy
We provide a detailed asymptotic study of gradient flow trajectories and...
read it

Ridgeless Interpolation with Shallow ReLU Networks in 1D is Nearest Neighbor Curvature Extrapolation and Provably Generalizes on Lipschitz Functions
We prove a precise geometric description of all one layer ReLU networks ...
read it

Gradient Dynamics of Shallow Univariate ReLU Networks
We present a theoretical and empirical study of the gradient dynamics of...
read it

Shallow Univariate ReLu Networks as Splines: Initialization, Loss Surface, Hessian, Gradient Flow Dynamics
Understanding the learning dynamics and inductive bias of neural network...
read it

How implicit regularization of Neural Networks affects the learned function – Part I
Today, various forms of neural networks are trained to perform approxima...
read it
Implicit bias of gradient descent for mean squared error regression with wide neural networks
We investigate gradient descent training of wide neural networks and the corresponding implicit bias in function space. Focusing on 1D regression, we show that the solution of training a widthn shallow ReLU network is within n^ 1/2 of the function which fits the training data and whose difference from initialization has smallest 2norm of the second derivative weighted by 1/ζ. The curvature penalty function 1/ζ is expressed in terms of the probability distribution that is utilized to initialize the network parameters, and we compute it explicitly for various common initialization procedures. For instance, asymmetric initialization with a uniform distribution yields a constant curvature penalty, and thence the solution function is the natural cubic spline interpolation of the training data. The statement generalizes to the training trajectories, which in turn are captured by trajectories of spatially adaptive smoothing splines with decreasing regularization strength.
READ FULL TEXT
Comments
There are no comments yet.