Combiner: Full Attention Transformer with Sparse Computation Cost

by   Hongyu Ren, et al.

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity š’Ŗ(L^2) with respect to the sequence length in attention layers, which restricts application in extremely long sequences. Most existing approaches leverage sparsity or low-rank assumptions in the attention matrix to reduce cost, but sacrifice expressiveness. Instead, we propose Combiner, which provides full attention capability in each attention head while maintaining low computation and memory complexity. The key idea is to treat the self-attention mechanism as a conditional expectation over embeddings at each location, and approximate the conditional distribution with a structured factorization. Each location can attend to all other locations, either via direct attention, or through indirect attention to abstractions, which are again conditional expectations of embeddings from corresponding local regions. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention, resulting in the same sub-quadratic cost (š’Ŗ(Llog(L)) or š’Ŗ(Lāˆš(L))). Combiner is a drop-in replacement for attention layers in existing transformers and can be easily implemented in common frameworks. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach, yielding state-of-the-art results on several image and text modeling tasks.

āˆ™ 10/21/2022

Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

Efficient Transformers have been developed for long sequence modeling, d...
āˆ™ 04/23/2019

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory t...
āˆ™ 06/08/2020

O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Transformer networks use pairwise attention to compute contextual embedd...
āˆ™ 11/08/2022

Linear Self-Attention Approximation via Trainable Feedforward Kernel

In pursuit of faster computation, Efficient Transformers demonstrate an ...
āˆ™ 06/05/2020

Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers

Transformer models have achieved state-of-the-art results across a diver...
āˆ™ 05/15/2020

Adaptive Transformers for Learning Multimodal Representations

The usage of transformers has grown from learning about language semanti...
āˆ™ 02/05/2023

KDEformer: Accelerating Transformers via Kernel Density Estimation

Dot-product attention mechanism plays a crucial role in modern deep arch...

Please sign up or login with your details

Forgot password? Click here to reset