Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost

10/27/2022
by   Sungjun Cho, et al.
0

To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as α-entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss. The forward and backward cost are thus linear to the number of edges, which each attention head can also choose flexibly based on the input. By assessing the distribution of graphs, we theoretically show that SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence functions in expectation. Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention. Our implementation can be found in https://github.com/sc782/SBM-Transformer .

READ FULL TEXT

page 18

page 19

research
11/18/2021

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Transformer-based models are widely used in natural language processing ...
research
08/30/2019

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures...
research
09/24/2021

Predicting Attention Sparsity in Transformers

A bottleneck in transformer architectures is their quadratic complexity ...
research
08/01/2023

FLatten Transformer: Vision Transformer using Focused Linear Attention

The quadratic computation complexity of self-attention has been a persis...
research
05/27/2022

What Dense Graph Do You Need for Self-Attention?

Transformers have made progress in miscellaneous tasks, but suffer from ...
research
02/28/2023

Sampled Transformer for Point Sets

The sparse transformer can reduce the computational complexity of the se...
research
12/21/2020

Sub-Linear Memory: How to Make Performers SLiM

The Transformer architecture has revolutionized deep learning on sequent...

Please sign up or login with your details

Forgot password? Click here to reset