Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

05/27/2019
by   Guodong Zhang, et al.
35

Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence for natural gradient descent on nonlinear neural networks with the squared-error loss. We identify two conditions which guarantee the efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) is full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For two-layer ReLU neural networks (i.e., with one hidden layer), we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that K-FAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/04/2021

Convergence of gradient descent for learning linear neural networks

We study the convergence properties of gradient descent for training dee...
research
06/12/2020

Implicit bias of gradient descent for mean squared error regression with wide neural networks

We investigate gradient descent training of wide neural networks and the...
research
05/17/2023

On the ISS Property of the Gradient Flow for Single Hidden-Layer Neural Networks with Linear Activations

Recent research in neural networks and machine learning suggests that us...
research
11/04/2019

Persistency of Excitation for Robustness of Neural Networks

When an online learning algorithm is used to estimate the unknown parame...
research
05/22/2023

Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond

Recent research shows that when Gradient Descent (GD) is applied to neur...
research
07/21/2018

On the Analysis of Trajectories of Gradient Descent in the Optimization of Deep Neural Networks

Theoretical analysis of the error landscape of deep neural networks has ...
research
10/03/2022

Limitations of neural network training due to numerical instability of backpropagation

We study the training of deep neural networks by gradient descent where ...

Please sign up or login with your details

Forgot password? Click here to reset