Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

05/24/2023
by   Ziwei He, et al.
0

The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to inherit weights from large pretrained models. In this work, the transformer's inefficiency has been taken care of from another perspective. We propose Fourier Transformer, a simple yet effective approach by progressively removing redundancies in hidden sequence using the ready-made Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation (DCT). Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Experiments show that our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA with significant improvement in both speed and space. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART and other efficient models. [Our code is publicly available at <https://github.com/LUMIA-Group/FourierTransformer>]

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/18/2021

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Transformer-based models are widely used in natural language processing ...
research
05/03/2023

A Lightweight CNN-Transformer Model for Learning Traveling Salesman Problems

Transformer-based models show state-of-the-art performance even for larg...
research
05/31/2023

Recasting Self-Attention with Holographic Reduced Representations

In recent years, self-attention has become the dominant paradigm for seq...
research
05/09/2021

FNet: Mixing Tokens with Fourier Transforms

We show that Transformer encoder architectures can be massively sped up,...
research
03/02/2022

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

Since their introduction the Trasformer architectures emerged as the dom...
research
05/02/2023

Unlimiformer: Long-Range Transformers with Unlimited Length Input

Transformer-based models typically have a predefined bound to their inpu...
research
07/22/2021

FNetAR: Mixing Tokens with Autoregressive Fourier Transforms

In this note we examine the autoregressive generalization of the FNet al...

Please sign up or login with your details

Forgot password? Click here to reset