Receptive Field Alignment Enables Transformer Length Extrapolation

12/20/2022
by   Ta-Chung Chi, et al.
0

Length extrapolation is a desirable property that permits training a transformer language model on short sequences and retaining similar perplexities when the model is tested on substantially longer sequences. A relative positional embedding mechanism applied on the transformer self-attention matrix, ALiBi, demonstrates the length extrapolation property with the widest usage to date. In this paper, we show that ALiBi surprisingly does not utilize tokens further than the training sequence length, which can be explained by its implicit windowed attention effect that aligns the receptive field during training and testing stages. Inspired by ALiBi and the receptive filed alignment hypothesis, we propose another transformer positional embedding design named Sandwich that uses longer than training sequence length information, and it is a greatly simplified formulation of the earliest proposed Sinusoidal positional embedding. Finally, we show that both ALiBi and Sandwich enable efficient inference thanks to their implicit windowed attention effect.

READ FULL TEXT

page 1

page 7

research
07/19/2023

Exploring Transformer Extrapolation

Length extrapolation has attracted considerable attention recently since...
research
06/01/2023

Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Transformer-based language models have found many diverse applications r...
research
05/16/2019

Joint Source-Target Self Attention with Locality Constraints

The dominant neural machine translation models are based on the encoder-...
research
12/20/2022

A Length-Extrapolatable Transformer

Position modeling plays a critical role in Transformers. In this paper, ...
research
08/27/2021

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Since the introduction of the transformer model by Vaswani et al. (2017)...
research
12/16/2022

Reducing Sequence Length Learning Impacts on Transformer Models

Classification algorithms using Transformer architectures can be affecte...
research
07/13/2020

Paranoid Transformer: Reading Narrative of Madness as Computational Approach to Creativity

This papers revisits the receptive theory in context of computational cr...

Please sign up or login with your details

Forgot password? Click here to reset