DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

by   Carmelo Scribano, et al.

Since their introduction the Trasformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of "fully-attentive" architectures arises from the computation of the dot-product attention, which grows both in memory consumption and number of operations as O(n^2) where n stands for the input sequence length, thus limiting the applications that require modeling very long sequences. Several approaches have been proposed so far in the literature to mitigate this issue, with varying degrees of success. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module by leveraging the properties of the Discrete Cosine Transform. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time. This makes it particularly suitable in real-time contexts on embedded platforms. Moreover, we assume that the results of our research might serve as a starting point for a broader family of deep neural models with reduced memory footprint. The implementation will be made publicly available at https://github.com/cscribano/DCT-Former-Public


page 1

page 2

page 3

page 4


Self-attention Does Not Need O(n^2) Memory

We present a very simple algorithm for attention that requires O(1) memo...

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

The transformer model is known to be computationally demanding, and proh...

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Self-attention has become increasingly popular in a variety of sequence ...

Multi Resolution Analysis (MRA) for Approximate Self-Attention

Transformers have emerged as a preferred model for many tasks in natural...

An Overview of Neural Network Compression

Overparameterized networks trained to convergence have shown impressive ...

VanillaNet: the Power of Minimalism in Deep Learning

At the heart of foundation models is the philosophy of "more is differen...

Please sign up or login with your details

Forgot password? Click here to reset