Polynomial Convergence of Gradient Descent for Training One-Hidden-Layer Neural Networks
We analyze Gradient Descent applied to learning a bounded target function on n real-valued inputs by training a neural network with a single hidden layer of nonlinear gates. Our main finding is that GD starting from a randomly initialized network converges in mean squared loss to the minimum error (in 2-norm) of the best approximation of the target function using a polynomial of degree at most k. Moreover, the size of the network and number of iterations needed are both bounded by n^O(k). The core of our analysis is the following existence theorem, which is of independent interest: for any ϵ > 0, any bounded function that has a degree-k polynomial approximation with error ϵ_0 (in 2-norm), can be approximated to within error ϵ_0 + ϵ as a linear combination of n^O(k)poly(1/ϵ) randomly chosen gates from any class of gates whose corresponding activation function has nonzero coefficients in its harmonic expansion for degrees up to k. In particular, this applies to training networks of unbiased sigmoids and ReLUs.
READ FULL TEXT