Effects of Depth, Width, and Initialization: A Convergence Analysis of Layer-wise Training for Deep Linear Neural Networks

10/14/2019
by   Yeonjong Shin, et al.
0

Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes. However, training deep neural networks is a challenging task. Many alternatives have been proposed in place of end-to-end back-propagation. Layer-wise training is one of them, which trains a single layer at a time, rather than trains the whole layers simultaneously. In this paper, we study a layer-wise training using a block coordinate gradient descent (BCGD) for deep linear networks. We establish a general convergence analysis of BCGD and found the optimal learning rate, which results in the fastest decrease in the loss. More importantly, the optimal learning rate can directly be applied in practice, as it does not require any prior knowledge. Thus, tuning the learning rate is not needed at all. Also, we identify the effects of depth, width, and initialization in the training process. We show that when the orthogonal-like initialization is employed, the width of intermediate layers plays no role in gradient-based training, as long as the width is greater than or equal to both the input and output dimensions. We show that under some conditions, the deeper the network is, the faster the convergence is guaranteed. This implies that in an extreme case, the global optimum is achieved after updating each weight matrix only once. Besides, we found that the use of deep networks could drastically accelerate convergence when it is compared to those of a depth 1 network, even when the computational cost is considered. Numerical examples are provided to justify our theoretical findings and demonstrate the performance of layer-wise training by BCGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/04/2018

A Convergence Analysis of Gradient Descent for Deep Linear Neural Networks

We analyze speed of convergence to global optimum for gradient descent t...
research
11/02/2021

Subquadratic Overparameterization for Shallow Neural Networks

Overparameterization refers to the important phenomenon where the width ...
research
04/13/2020

On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

In recent years, a critical initialization scheme with orthogonal initia...
research
01/16/2020

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

The selection of initial parameter values for gradient-based optimizatio...
research
12/19/2014

Random Walk Initialization for Training Very Deep Feedforward Networks

Training very deep networks is an important open problem in machine lear...
research
12/14/2022

Maximal Initial Learning Rates in Deep ReLU Networks

Training a neural network requires choosing a suitable learning rate, in...
research
05/26/2020

Is deeper better? It depends on locality of relevant features

It has been recognized that a heavily overparameterized artificial neura...

Please sign up or login with your details

Forgot password? Click here to reset