Transformer Quality in Linear Time

02/21/2022
by   Weizhe Hua, et al.
0

We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9× on Wiki-40B and 12.1× on PG-19 for auto-regressive language modeling, and 4.8× on C4 for masked language modeling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2022

Mega: Moving Average Equipped Gated Attention

The design choices in the Transformer attention mechanism, including wea...
research
12/28/2022

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

State space models (SSMs) have demonstrated state-of-the-art sequence mo...
research
09/17/2021

Primer: Searching for Efficient Transformers for Language Modeling

Large Transformer models have been central to recent advances in natural...
research
06/21/2023

Iterated Piecewise Affine (IPA) Approximation for Language Modeling

In this work, we demonstrate the application of a simple first-order Tay...
research
03/17/2021

Value-aware Approximate Attention

Following the success of dot-product attention in Transformers, numerous...
research
03/16/2023

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

Transformer-based language models (LMs) create hidden representations of...
research
11/26/2019

Single Headed Attention RNN: Stop Thinking With Your Head

The leading approaches in language modeling are all obsessed with TV sho...

Please sign up or login with your details

Forgot password? Click here to reset