Efficient Content-Based Sparse Attention with Routing Transformers

03/12/2020
by   Aurko Roy, et al.
3

Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: it combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O(n^1.5d) from O(n^2d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity) as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2021

Linear-Time Self Attention with Codeword Histogram for Efficient Recommendation

Self-attention has become increasingly popular in a variety of sequence ...
research
02/26/2020

Sparse Sinkhorn Attention

We propose Sparse Sinkhorn Attention, a new efficient and sparse method ...
research
03/22/2020

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

While the self-attention mechanism has been widely used in a wide variet...
research
12/10/2021

Self-attention Does Not Need O(n^2) Memory

We present a very simple algorithm for attention that requires O(1) memo...
research
09/01/2022

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

As its core computation, a self-attention mechanism gauges pairwise corr...
research
03/29/2022

Efficient Localness Transformer for Smart Sensor-Based Energy Disaggregation

Modern smart sensor-based energy management systems leverage non-intrusi...
research
11/29/2022

Lightweight Structure-Aware Attention for Visual Understanding

Vision Transformers (ViTs) have become a dominant paradigm for visual re...

Please sign up or login with your details

Forgot password? Click here to reset