
Stochastic Gradient Descent Optimizes Overparameterized Deep ReLU Networks
We study the problem of training deep neural networks with Rectified Lin...
11/21/2018 ∙ by Difan Zou, et al. ∙ 18 ∙ shareread it

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks
We analyze speed of convergence to global optimum for gradient descent t...
10/04/2018 ∙ by Sanjeev Arora, et al. ∙ 8 ∙ shareread it

On Symmetry and Initialization for Neural Networks
This work provides an additional step in the theoretical understanding o...
07/01/2019 ∙ by Ido Nachum, et al. ∙ 0 ∙ shareread it

Exponential Convergence Time of Gradient Descent for OneDimensional Deep Linear Neural Networks
In this note, we study the dynamics of gradient descent on objective fun...
09/23/2018 ∙ by Ohad Shamir, et al. ∙ 0 ∙ shareread it

Polynomial Convergence of Gradient Descent for Training OneHiddenLayer Neural Networks
We analyze Gradient Descent applied to learning a bounded target functio...
05/07/2018 ∙ by Santosh Vempala, et al. ∙ 0 ∙ shareread it

Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes
In this paper, we theoretically prove that gradient descent can find a g...
08/05/2019 ∙ by Kenji Kawaguchi, et al. ∙ 3 ∙ shareread it

The Global Optimization Geometry of Shallow Linear Neural Networks
We examine the squared error loss landscape of shallow linear neural net...
05/13/2018 ∙ by Zhihui Zhu, et al. ∙ 0 ∙ shareread it
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
Natural gradient descent has proven effective at mitigating the effects of pathological curvature in neural network optimization, but little is known theoretically about its convergence properties, especially for nonlinear networks. In this work, we analyze for the first time the speed of convergence for natural gradient descent on nonlinear neural networks with the squarederror loss. We identify two conditions which guarantee the efficient convergence from random initializations: (1) the Jacobian matrix (of network's output for all training cases with respect to the parameters) is full row rank, and (2) the Jacobian matrix is stable for small perturbations around the initialization. For twolayer ReLU neural networks (i.e., with one hidden layer), we prove that these two conditions do in fact hold throughout the training, under the assumptions of nondegenerate inputs and overparameterization. We further extend our analysis to more general loss functions. Lastly, we show that KFAC, an approximate natural gradient descent method, also converges to global minima under the same assumptions.
READ FULL TEXT