DeepAI AI Chat
Log In Sign Up

Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks

by   Santosh Vempala, et al.

We analyze Gradient Descent applied to learning a bounded target function on n real-valued inputs by training a neural network with a single hidden layer of nonlinear gates. Our main finding is that GD starting from a randomly initialized network converges in mean squared loss to the minimum error (in 2-norm) of the best approximation of the target function using a polynomial of degree at most k. Moreover, the size of the network and number of iterations needed are both bounded by n^O(k). The core of our analysis is the following existence theorem, which is of independent interest: for any ϵ > 0, any bounded function that has a degree-k polynomial approximation with error ϵ_0 (in 2-norm), can be approximated to within error ϵ_0 + ϵ as a linear combination of n^O(k)poly(1/ϵ) randomly chosen gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to k. In particular, this applies to training networks of unbiased sigmoids and ReLUs.


page 1

page 2

page 3

page 4


The Dynamics of Gradient Descent for Overparametrized Neural Networks

We consider the dynamics of gradient descent (GD) in overparameterized s...

Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks

Natural gradient descent has proven effective at mitigating the effects ...

Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima

We consider the problem of learning a one-hidden-layer neural network wi...

On the rate of convergence of a neural network regression estimate learned by gradient descent

Nonparametric regression with random design is considered. Estimates are...


In this work we propose Hebbian-descent as a biologically plausible lear...

Training Fully Connected Neural Networks is ∃ℝ-Complete

We consider the algorithmic problem of finding the optimal weights and b...

Universal scaling laws in the gradient descent training of neural networks

Current theoretical results on optimization trajectories of neural netwo...