The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit

06/30/2023
by   Lorenzo Noci, et al.
0

In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with skip connections in the proportional limit of infinite-depth-and-width. We show that at initialization the limiting distribution can be described by a stochastic differential equation (SDE) indexed by the depth-to-width ratio. To achieve a well-defined stochastic limit, the Transformer's attention mechanism is modified by centering the Softmax output at identity, and scaling the Softmax logits by a width-dependent temperature parameter. We examine the stability of the network through the corresponding SDE, showing how the scale of both the drift and diffusion can be elegantly controlled with the aid of residual connections. The existence of a stable SDE implies that the covariance structure is well-behaved, even for very large depth and width, thus preventing the notorious issues of rank degeneracy in deep attention models. Finally, we show, through simulations, that the SDE provides a surprisingly good description of the corresponding finite-size model. We coin the name shaped Transformer for these architectural modifications.

READ FULL TEXT
research
06/06/2022

The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

The logit outputs of a feedforward neural network at initialization are ...
research
02/01/2023

Width and Depth Limits Commute in Residual Networks

We show that taking the width and depth to infinity in a deep neural net...
research
10/03/2022

On the infinite-depth limit of finite-width neural networks

In this paper, we study the infinite-depth limit of finite-width residua...
research
06/07/2021

The Future is Log-Gaussian: ResNets and Their Infinite-Depth-and-Width Limit at Initialization

Theoretical results show that neural networks can be approximated by Gau...
research
06/11/2020

Dynamically Stable Infinite-Width Limits of Neural Classifiers

Recent research has been focused on two different approaches to studying...
research
05/09/2021

Which transformer architecture fits my data? A vocabulary bottleneck in self-attention

After their successful debut in natural language processing, Transformer...
research
03/30/2023

Neural signature kernels as infinite-width-depth-limits of controlled ResNets

Motivated by the paradigm of reservoir computing, we consider randomly i...

Please sign up or login with your details

Forgot password? Click here to reset