Blockwise Parallel Transformer for Long Context Large Models

05/30/2023
by   Hao Liu, et al.
0

Transformers have emerged as the cornerstone of state-of-the-art natural language processing models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands posed by the self-attention mechanism and the large feedforward network in Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving multiple long sequences or long-term dependencies. We present a distinct approach, Blockwise Parallel Transformer (BPT), that leverages blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences up to 32 times longer than vanilla Transformers and 2 to 4 times longer than previous memory-efficient methods. Extensive experiments on language modeling and reinforcement learning tasks demonstrate the effectiveness of BPT in reducing memory requirements and improving performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/07/2021

Nyströmformer: A Nyström-Based Algorithm for Approximating Self-Attention

Transformers have emerged as a powerful tool for a broad range of natura...
research
11/10/2019

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention an...
research
01/09/2019

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer networks have a potential of learning longer-term dependency...
research
03/17/2023

CoLT5: Faster Long-Range Transformers with Conditional Computation

Many natural language processing tasks benefit from long inputs, but pro...
research
08/25/2023

Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers

Although dominant in natural language processing, transformer-based mode...
research
01/23/2023

AttMEMO : Accelerating Transformers with Memoization on Big Memory Systems

Transformer models gain popularity because of their superior inference a...
research
10/06/2021

Ripple Attention for Visual Perception with Sub-quadratic Complexity

Transformer architectures are now central to modeling in natural languag...

Please sign up or login with your details

Forgot password? Click here to reset