O(n) Connections are Expressive Enough: Universal Approximability of Sparse Transformers

06/08/2020
by   Chulhee Yun, et al.
5

Transformer networks use pairwise attention to compute contextual embeddings of inputs, and have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length n to compute attention in each layer. This has prompted recent research into faster attention models, with a predominant approach involving sparsifying the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. Our analysis proposes sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show the existence of models with only O(n) connections per attention layer that can approximate the same function class as the dense model with n^2 connections. Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2019

Are Transformers universal approximators of sequence-to-sequence functions?

Despite the widespread adoption of Transformer models for NLP tasks, the...
research
07/12/2021

Combiner: Full Attention Transformer with Sparse Computation Cost

Transformers provide a class of expressive architectures that are extrem...
research
10/07/2022

Breaking BERT: Evaluating and Optimizing Sparsified Attention

Transformers allow attention between all pairs of tokens, but there is r...
research
09/17/2020

Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation

Recent work on the lottery ticket hypothesis has produced highly sparse ...
research
02/16/2017

Addressing the Data Sparsity Issue in Neural AMR Parsing

Neural attention models have achieved great success in different NLP tas...
research
05/26/2022

Your Transformer May Not be as Powerful as You Expect

Relative Positional Encoding (RPE), which encodes the relative distance ...
research
05/27/2022

What Dense Graph Do You Need for Self-Attention?

Transformers have made progress in miscellaneous tasks, but suffer from ...

Please sign up or login with your details

Forgot password? Click here to reset