Transformer with Fourier Integral Attentions

06/01/2022
by   Tan Nguyen, et al.
79

Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the queries and keys, which results from the use of unnormalized Gaussian kernels with the assumption that the queries follow a mixture of Gaussian distribution. There is no guarantee that this assumption is valid in practice. In response, we first interpret attention in transformers as a nonparametric kernel regression. We then propose the FourierFormer, a new class of transformers in which the dot-product kernels are replaced by the novel generalized Fourier integral kernels. Different from the dot-product kernels, where we need to choose a good covariance matrix to capture the dependency of the features of data, the generalized Fourier integral kernels can automatically capture such dependency and remove the need to tune the covariance matrix. We theoretically prove that our proposed Fourier integral kernels can efficiently approximate any key and query distributions. Compared to the conventional transformers with dot-product attention, FourierFormers attain better accuracy and reduce the redundancy between attention heads. We empirically corroborate the advantages of FourierFormers over the baseline transformers in a variety of practical applications including language modeling and image classification.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/16/2021

Transformer with a Mixture of Gaussian Keys

Multi-head attention is a driving force behind state-of-the-art transfor...
research
03/04/2023

Calibrating Transformers via Sparse Gaussian Processes

Transformer models have achieved profound success in prediction tasks in...
research
06/17/2021

XCiT: Cross-Covariance Image Transformers

Following their success in natural language processing, transformers hav...
research
07/09/2020

Fast Transformers with Clustered Attention

Transformers have been proven a successful model for a variety of tasks ...
research
02/03/2023

Learning a Fourier Transform for Linear Relative Positional Encodings in Transformers

We propose a new class of linear Transformers called FourierLearner-Tran...
research
06/30/2020

Data Movement Is All You Need: A Case Study on Optimizing Transformers

Transformers have become widely used for language modeling and sequence ...
research
05/30/2022

Chefs' Random Tables: Non-Trigonometric Random Features

We introduce chefs' random tables (CRTs), a new class of non-trigonometr...

Please sign up or login with your details

Forgot password? Click here to reset