Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks

05/07/2018
by   Santosh Vempala, et al.
0

We analyze Gradient Descent applied to learning a bounded target function on n real-valued inputs by training a neural network with a single hidden layer of nonlinear gates. Our main finding is that GD starting from a randomly initialized network converges in mean squared loss to the minimum error (in 2-norm) of the best approximation of the target function using a polynomial of degree at most k. Moreover, the size of the network and number of iterations needed are both bounded by n^O(k). The core of our analysis is the following existence theorem, which is of independent interest: for any ϵ > 0, any bounded function that has a degree-k polynomial approximation with error ϵ_0 (in 2-norm), can be approximated to within error ϵ_0 + ϵ as a linear combination of n^O(k)poly(1/ϵ) randomly chosen gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to k. In particular, this applies to training networks of unbiased sigmoids and ReLUs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2021

The Dynamics of Gradient Descent for Overparametrized Neural Networks

We consider the dynamics of gradient descent (GD) in overparameterized s...
research
10/14/2018

Variational Neural Networks: Every Layer and Neuron Can Be Unique

The choice of activation function can significantly influence the perfor...
research
12/03/2017

Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

We consider the problem of learning a one-hidden-layer neural network wi...
research
12/09/2019

On the rate of convergence of a neural network regression estimate learned by gradient descent

Nonparametric regression with random design is considered. Estimates are...
research
05/25/2019

Hebbian-Descent

In this work we propose Hebbian-descent as a biologically plausible lear...
research
05/02/2021

Universal scaling laws in the gradient descent training of neural networks

Current theoretical results on optimization trajectories of neural netwo...
research
08/12/2017

Sparse Coding and Autoencoders

In "Dictionary Learning" one tries to recover incoherent matrices A^* ∈R...

Please sign up or login with your details

Forgot password? Click here to reset