FLuRKA: Fast fused Low-Rank Kernel Attention

06/27/2023
by   Ahan Gupta, et al.
0

Many efficient approximate self-attention techniques have become prevalent since the inception of the transformer architecture. Two popular classes of these techniques are low-rank and kernel methods. Each of these methods has its own strengths. We observe these strengths synergistically complement each other and exploit these synergies to fuse low-rank and kernel methods, producing a new class of transformers: FLuRKA (Fast Low-Rank and Kernel Attention). FLuRKA provide sizable performance gains over these approximate techniques and are of high quality. We theoretically and empirically evaluate both the runtime performance and quality of FLuRKA. Our runtime analysis posits a variety of parameter configurations where FLuRKA exhibit speedups and our accuracy analysis bounds the error of FLuRKA with respect to full-attention. We instantiate three FLuRKA variants which experience empirical speedups of up to 3.3x and 1.7x over low-rank and kernel methods respectively. This translates to speedups of up to 30x over models with full-attention. With respect to model quality, FLuRKA can match the accuracy of low-rank and kernel methods on GLUE after pre-training on wiki-text 103. When pre-training on a fixed time budget, FLuRKA yield better perplexity scores than models with full-attention.

READ FULL TEXT
research
10/28/2021

Scatterbrain: Unifying Sparse and Low-rank Attention Approximation

Recent advances in efficient Transformers have exploited either the spar...
research
07/11/2023

Stack More Layers Differently: High-Rank Training Through Low-Rank Updates

Despite the dominance and effectiveness of scaling, resulting in large n...
research
02/08/2022

Linear Time Kernel Matrix Approximation via Hyperspherical Harmonics

We propose a new technique for constructing low-rank approximations of m...
research
11/08/2022

Linear Self-Attention Approximation via Trainable Feedforward Kernel

In pursuit of faster computation, Efficient Transformers demonstrate an ...
research
02/04/2022

Learning Representation from Neural Fisher Kernel with Low-rank Approximation

In this paper, we study the representation of neural networks from the v...
research
02/17/2020

Low-Rank Bottleneck in Multi-head Attention Models

Attention based Transformer architecture has enabled significant advance...
research
05/25/2023

Sharpness-Aware Minimization Leads to Low-Rank Features

Sharpness-aware minimization (SAM) is a recently proposed method that mi...

Please sign up or login with your details

Forgot password? Click here to reset