MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

05/12/2023
by   Lili Yu, et al.
0

Autoregressive transformers are spectacular models for short sequences but scale poorly to long sequences such as high-resolution images, podcasts, code, or books. We proposed Megabyte, a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Megabyte segments sequences into patches and uses a local submodel within patches and a global model between patches. This enables sub-quadratic self-attention, much larger feedforward layers for the same compute, and improved parallelism during decoding – unlocking better performance at reduced cost for both training and generation. Extensive experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling, achieve state-of-the-art density estimation on ImageNet, and model audio from raw files. Together, these results establish the viability of tokenization-free autoregressive sequence modeling at scale.

READ FULL TEXT
research
04/23/2019

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory t...
research
03/20/2023

Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers

Autoregressive transformers have shown remarkable success in video gener...
research
02/15/2022

General-purpose, long-context autoregressive modeling with Perceiver AR

Real-world data is high-dimensional: a book, image, or musical performan...
research
05/22/2023

FIT: Far-reaching Interleaved Transformers

We present FIT: a transformer-based architecture with efficient self-att...
research
02/20/2022

It's Raw! Audio Generation with State-Space Models

Developing architectures suitable for modeling raw audio is a challengin...
research
06/16/2022

GoodBye WaveNet – A Language Model for Raw Audio with Context of 1/2 Million Samples

Modeling long-term dependencies for audio signals is a particularly chal...
research
12/20/2019

Axial Attention in Multidimensional Transformers

We propose Axial Transformers, a self-attention-based autoregressive mod...

Please sign up or login with your details

Forgot password? Click here to reset