On the impact of activation and normalization in obtaining isometric embeddings at initialization

05/28/2023
by   Amir Joudaki, et al.
0

In this paper, we explore the structure of the penultimate Gram matrix in deep neural networks, which contains the pairwise inner products of outputs corresponding to a batch of inputs. In several architectures it has been observed that this Gram matrix becomes degenerate with depth at initialization, which dramatically slows training. Normalization layers, such as batch or layer normalization, play a pivotal role in preventing the rank collapse issue. Despite promising advances, the existing theoretical results (i) do not extend to layer normalization, which is widely used in transformers, (ii) can not characterize the bias of normalization quantitatively at finite depth. To bridge this gap, we provide a proof that layer normalization, in conjunction with activation layers, biases the Gram matrix of a multilayer perceptron towards isometry at an exponential rate with depth at initialization. We quantify this rate using the Hermite expansion of the activation function, highlighting the importance of higher order (≥ 2) Hermite coefficients in the bias towards isometry.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/03/2019

Static Activation Function Normalization

Recent seminal work at the intersection of deep neural networks practice...
research
10/04/2019

Farkas layers: don't shift the data, fix the geometry

Successfully training deep neural networks often requires either batch n...
research
06/07/2019

The Normalization Method for Alleviating Pathological Sharpness in Wide Neural Networks

Normalization methods play an important role in enhancing the performanc...
research
10/05/2022

Dynamical Isometry for Residual Networks

The training success, training speed and generalization ability of neura...
research
01/21/2021

Characterizing signal propagation to close the performance gap in unnormalized ResNets

Batch Normalization is a key component in almost all state-of-the-art im...
research
01/04/2023

Solving The Ordinary Least Squares in Closed Form, Without Inversion or Normalization

By connecting the LU factorization and the Gram-Schmidt orthogonalizatio...
research
04/06/2020

Evolving Normalization-Activation Layers

Normalization layers and activation functions are critical components in...

Please sign up or login with your details

Forgot password? Click here to reset