Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

11/13/2017
by   Jeffrey Pennington, et al.
0

It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is O(1) is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near 1 is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/27/2018

The Emergence of Spectral Universality in Deep Networks

Recent work has shown that tight concentration of the entire spectrum of...
research
07/31/2018

Spectrum concentration in deep residual learning: a free probability appproach

We revisit the initialization of deep residual networks (ResNets) by int...
research
03/07/2022

Singular Value Perturbation and Deep Network Optimization

We develop new theoretical results on matrix perturbation to shed light ...
research
10/09/2018

Information Geometry of Orthogonal Initializations and Training

Recently mean field theory has been successfully used to analyze propert...
research
12/06/2018

Singular Values for ReLU Layers

Despite their prevalence in neural networks we still lack a thorough the...
research
10/22/2020

Deep Learning is Singular, and That's Good

In singular models, the optimal set of parameters forms an analytic set ...
research
05/15/2019

Orthogonal Deep Neural Networks

In this paper, we introduce the algorithms of Orthogonal Deep Neural Net...

Please sign up or login with your details

Forgot password? Click here to reset