Inductive Biases and Variable Creation in Self-Attention Mechanisms

10/19/2021
by   Benjamin L. Edelman, et al.
0

Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules, where our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer layers create sparse variables: they can represent sparse functions of the input sequence, with sample complexity scaling only logarithmically with the context length. Furthermore, we propose new experimental protocols to support this analysis and to guide the practice of training Transformers, built around the large body of work on provably learning sparse Boolean functions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Transformer-based models are widely used in natural language processing ...
research
09/13/2020

Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding

Transformer has become ubiquitous in the deep learning field. One of the...
research
10/13/2019

Transformer with Gaussian weighted self-attention for speech enhancement

The Transformer architecture recently replaced recurrent neural networks...
research
11/11/2019

BP-Transformer: Modelling Long-Range Context via Binary Partitioning

The Transformer model is widely successful on many natural language proc...
research
06/18/2020

I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths

Self-attention has emerged as a vital component of state-of-the-art sequ...
research
10/13/2022

Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries

In this paper, we show that structures similar to self-attention are nat...
research
03/08/2021

Lipschitz Normalization for Self-Attention Layers with Application to Graph Neural Networks

Attention based neural networks are state of the art in a large range of...

Please sign up or login with your details

Forgot password? Click here to reset