Parameter Norm Growth During Training of Transformers

10/19/2020
by   William Merrill, et al.
0

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically some variant of gradient descent (GD). To better understand this bias, we study the tendency of transformer parameters to grow in magnitude during training. We find, both theoretically and empirically, that, in certain contexts, GD increases the parameter L_2 norm up to a threshold that itself increases with training-set accuracy. This means increasing training accuracy over time enables the norm to increase. Empirically, we show that the norm grows continuously over pretraining for T5 (Raffel et al., 2019). We show that pretrained T5 approximates a semi-discretized network with saturated activation functions. Such "saturated" networks are known to have a reduced capacity compared to the original network family that can be described in automata-theoretic terms. This suggests saturation is a new characterization of an inductive bias implicit in GD that is of particular interest for NLP. While our experiments focus on transformers, our theoretical analysis extends to other architectures with similar formal properties, such as feedforward ReLU networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2022

Support Vectors and Gradient Dynamics for Implicit Bias in ReLU Networks

Understanding implicit bias of gradient descent has been an important go...
research
06/10/2022

Intrinsic dimensionality and generalization properties of the ℛ-norm inductive bias

We study the structural and statistical properties of ℛ-norm minimizing ...
research
02/05/2022

The Implicit Bias of Gradient Descent on Generalized Gated Linear Networks

Understanding the asymptotic behavior of gradient-descent training of de...
research
02/27/2015

Norm-Based Capacity Control in Neural Networks

We investigate the capacity, convexity and characterization of a general...
research
12/16/2021

Trees in transformers: a theoretical analysis of the Transformer's ability to represent trees

Transformer networks are the de facto standard architecture in natural l...
research
07/21/2022

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling?

There have been a lot of interest in the scaling properties of Transform...
research
03/02/2023

Penalising the biases in norm regularisation enforces sparsity

Controlling the parameters' norm often yields good generalisation when t...

Please sign up or login with your details

Forgot password? Click here to reset