Width Provably Matters in Optimization for Deep Linear Neural Networks

01/24/2019
by   Simon S. Du, et al.
0

We prove that for an L-layer fully-connected linear neural network, if the width of every hidden layer is Ω̃ (L · r · d_out·κ^3 ), where r and κ are the rank and the condition number of the input data, and d_out is the output dimension, then gradient descent with Gaussian random initialization converges to a global minimum at a linear rate. The number of iterations to find an ϵ-suboptimal solution is O(κ(1/ϵ)). Our polynomial upper bound on the total running time for wide deep linear networks and the (Ω(L)) lower bound for narrow deep linear neural networks [Shamir, 2018] together demonstrate that wide layers are necessary for optimizing deep models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2020

On the Global Convergence of Training Deep Linear ResNets

We study the convergence of gradient descent (GD) and stochastic gradien...
research
11/19/2022

Can Gradient Descent Provably Learn Linear Dynamic Systems?

We study the learning ability of linear recurrent neural networks with g...
research
06/24/2020

Towards Understanding Hierarchical Learning: Benefits of Neural Representations

Deep neural networks can empirically perform efficient hierarchical lear...
research
10/29/2020

What can we learn from gradients?

Recent work (<cit.>) has shown that it is possible to reconstruct the in...
research
11/02/2021

Subquadratic Overparameterization for Shallow Neural Networks

Overparameterization refers to the important phenomenon where the width ...
research
03/10/2022

Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models

Wide neural networks with linear output layer have been shown to be near...
research
07/10/2023

Self Expanding Neural Networks

The results of training a neural network are heavily dependent on the ar...

Please sign up or login with your details

Forgot password? Click here to reset