Batch Normalization Orthogonalizes Representations in Deep Random Networks

06/07/2021
by   Hadi Daneshmand, et al.
0

This paper underlines a subtle property of batch-normalization (BN): Successive batch normalizations with random linear transformations make hidden representations increasingly orthogonal across layers of a deep neural network. We establish a non-asymptotic characterization of the interplay between depth, width, and the orthogonality of deep representations. More precisely, under a mild assumption, we prove that the deviation of the representations from orthogonality rapidly decays with depth up to a term inversely proportional to the network width. This result has two main implications: 1) Theoretically, as the depth grows, the distribution of the representation – after the linear layers – contracts to a Wasserstein-2 ball around an isotropic Gaussian distribution. Furthermore, the radius of this Wasserstein ball shrinks with the width of the network. 2) In practice, the orthogonality of the representations directly influences the performance of stochastic gradient descent (SGD). When representations are initially aligned, we observe SGD wastes many iterations to orthogonalize representations before the classification. Nevertheless, we experimentally show that starting optimization from orthogonal representations is sufficient to accelerate SGD, with no need for BN.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/02/2020

On the Global Convergence of Training Deep Linear ResNets

We study the convergence of gradient descent (GD) and stochastic gradien...
research
02/24/2023

On the Training Instability of Shuffling SGD with Batch Normalization

We uncover how SGD interacts with batch normalization and can exhibit un...
research
05/25/2022

Entropy Maximization with Depth: A Variational Principle for Random Neural Networks

To understand the essential role of depth in neural networks, we investi...
research
05/09/2019

The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

We investigate how the final parameters found by stochastic gradient des...
research
06/30/2022

A note on Linear Bottleneck networks and their Transition to Multilinearity

Randomly initialized wide neural networks transition to linear functions...
research
06/11/2018

The Effect of Network Width on the Performance of Large-batch Training

Distributed implementations of mini-batch stochastic gradient descent (S...

Please sign up or login with your details

Forgot password? Click here to reset