Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

08/27/2021
by   Ofir Press, et al.
0

Since the introduction of the transformer model by Vaswani et al. (2017), a fundamental question remains open: how to achieve extrapolation at inference time to longer sequences than seen during training? We first show that extrapolation can be improved by changing the position representation method, though we find that existing proposals do not allow efficient extrapolation. We introduce a simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation. ALiBi does not add positional embeddings to the word embeddings; instead, it biases the query-key attention scores with a term that is proportional to their distance. We show that this method allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048, 11 faster and using 11 it to outperform multiple strong position methods on the WikiText-103 benchmark. Finally, we provide analysis of ALiBi to understand why it leads to better performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/31/2020

Shortformer: Better Language Modeling using Shorter Inputs

We explore the benefits of decreasing the input length of transformers. ...
research
12/20/2022

A Length-Extrapolatable Transformer

Position modeling plays a critical role in Transformers. In this paper, ...
research
05/31/2023

Monotonic Location Attention for Length Generalization

We explore different ways to utilize position-based cross-attention in s...
research
12/20/2022

Receptive Field Alignment Enables Transformer Length Extrapolation

Length extrapolation is a desirable property that permits training a tra...
research
09/28/2020

Improve Transformer Models with Better Relative Position Embeddings

Transformer architectures rely on explicit position encodings in order t...
research
08/21/2023

Giraffe: Adventures in Expanding Context Lengths in LLMs

Modern large language models (LLMs) that rely on attention mechanisms ar...
research
02/28/2022

Rethinking and Refining the Distinct Metric

Distinct is a widely used automatic metric for evaluating the diversity ...

Please sign up or login with your details

Forgot password? Click here to reset