Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

10/05/2021
by   Narsimha Chilkuri, et al.
0

Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their performance hinges on processing large amounts of data, and their computational and memory requirements grow quadratically with sequence length. Motivated by these considerations, we construct a Legendre Memory Unit based model that introduces a general prior for sequence processing and exhibits an O(n) and O(n ln n) (or better) dependency for memory and computation respectively. Over three orders of magnitude, we show that our new architecture attains the same accuracy as transformers with 10x fewer tokens. We also show that for the same amount of training our model improves the loss over transformers about as much as transformers improve over LSTMs. Additionally, we demonstrate that adding global self-attention complements our architecture and the augmented model improves performance even further.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/28/2022

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

State space models (SSMs) have demonstrated state-of-the-art sequence mo...
research
06/21/2023

Iterated Piecewise Affine (IPA) Approximation for Language Modeling

In this work, we demonstrate the application of a simple first-order Tay...
research
06/30/2020

Data Movement Is All You Need: A Case Study on Optimizing Transformers

Transformers have become widely used for language modeling and sequence ...
research
06/20/2020

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many ...
research
09/11/2023

Uncovering mesa-optimization algorithms in Transformers

Transformers have become the dominant model in deep learning, but the re...
research
03/11/2022

Block-Recurrent Transformers

We introduce the Block-Recurrent Transformer, which applies a transforme...
research
04/15/2022

LaMemo: Language Modeling with Look-Ahead Memory

Although Transformers with fully connected self-attentions are powerful ...

Please sign up or login with your details

Forgot password? Click here to reset