Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

06/23/2021
by   Shengjie Luo, et al.
9

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves 𝒪(nlog n) time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/23/2022

FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks

Transformers achieve remarkable performance in various domains, includin...
research
02/03/2023

Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers

We propose a new class of linear Transformers called FourierLearner-Tran...
research
05/18/2021

Relative Positional Encoding for Transformers with Linear Complexity

Recent advances in Transformer models allow for unprecedented sequence l...
research
07/18/2023

Linearized Relative Positional Encoding

Relative positional encoding is widely used in vanilla and linear transf...
research
05/26/2022

Your Transformer May Not be as Powerful as You Expect

Relative Positional Encoding (RPE), which encodes the relative distance ...
research
11/08/2020

Long Range Arena: A Benchmark for Efficient Transformers

Transformers do not scale very well to long sequence lengths largely bec...
research
02/05/2023

KDEformer: Accelerating Transformers via Kernel Density Estimation

Dot-product attention mechanism plays a crucial role in modern deep arch...

Please sign up or login with your details

Forgot password? Click here to reset