A Convergence Theory for Deep Learning via Over-Parameterization

by   Zeyuan Allen-Zhu, et al.

Deep neural networks (DNNs) have demonstrated dominating performance in many fields, e.g., computer vision, natural language progressing, and robotics. Since AlexNet, the neural networks used in practice are going wider and deeper. On the theoretical side, a long line of works have been focusing on why we can train neural networks when there is only one hidden layer. The theory of multi-layer neural networks remains somewhat unsettled. We present a new theory to understand the convergence of training DNNs. We only make two assumptions: the inputs do not degenerate and the network is over-parameterized. The latter means the number of hidden neurons is sufficiently large: polynomial in n, the number of training samples and in L, the number of layers. We show on the training dataset, starting from randomly initialized weights, simple algorithms such as stochastic gradient descent attain 100 classification tasks, or minimize ℓ_2 regression loss in linear convergence rate, with a number of iterations that only scale polynomial in n and L. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).


page 1

page 2

page 3

page 4


On the Convergence Rate of Training Recurrent Neural Networks

Despite the huge success of deep learning, our understanding to how the ...

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

One of the mystery in the success of neural networks is randomly initial...

Constrained Empirical Risk Minimization: Theory and Practice

Deep Neural Networks (DNNs) are widely used for their ability to effecti...

Perceptron Theory for Predicting the Accuracy of Neural Networks

Many neural network models have been successful at classification proble...

Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-Layer Networks

We develop exact representations of two layer neural networks with recti...

Improved Learning of One-hidden-layer Convolutional Neural Networks with Overlaps

We propose a new algorithm to learn a one-hidden-layer convolutional neu...

Training Fully Connected Neural Networks is ∃ℝ-Complete

We consider the algorithmic problem of finding the optimal weights and b...

Please sign up or login with your details

Forgot password? Click here to reset