Global Convergence of Gradient Descent for Deep Linear Residual Networks

by   Lei Wu, et al.

We analyze the global convergence of gradient descent for deep linear residual networks by proposing a new initialization: zero-asymmetric (ZAS) initialization. It is motivated by avoiding stable manifolds of saddle points. We prove that under the ZAS initialization, for an arbitrary target matrix, gradient descent converges to an ε-optimal point in O(L^3 log(1/ε)) iterations, which scales polynomially with the network depth L. Our result and the (Ω(L)) convergence time for the standard initialization (Xavier or near-identity) [Shamir, 2018] together demonstrate the importance of the residual structure and the initialization in the optimization for deep linear neural networks, especially when L is large.


page 1

page 2

page 3

page 4


Convergence of gradient descent for learning linear neural networks

We study the convergence properties of gradient descent for training dee...

Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks

In this note, we study the dynamics of gradient descent on objective fun...

Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

We analyze algorithms for approximating a function f(x) = Φ x mapping ^d...

Directional Convergence Analysis under Spherically Symmetric Distribution

We consider the fundamental problem of learning linear predictors (i.e.,...

Weighted Residuals for Very Deep Networks

Deep residual networks have recently shown appealing performance on many...

Why does CTC result in peaky behavior?

The peaky behavior of CTC models is well known experimentally. However, ...

Stronger Convergence Results for Deep Residual Networks: Network Width Scales Linearly with Training Data Size

Deep neural networks are highly expressive machine learning models with ...