Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences

10/21/2022
by   Aosong Feng, et al.
0

Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose Diffuser, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94 classification tasks and 2.30 compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/12/2021

Combiner: Full Attention Transformer with Sparse Computation Cost

Transformers provide a class of expressive architectures that are extrem...
research
03/01/2021

OmniNet: Omnidirectional Representations from Transformers

This paper proposes Omnidirectional Representations from Transformers (O...
research
04/02/2021

TFill: Image Completion via a Transformer-Based Architecture

Bridging distant context interactions is important for high quality imag...
research
08/05/2021

FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention

We propose FMMformers, a class of efficient and flexible transformers in...
research
03/29/2022

Efficient Localness Transformer for Smart Sensor-Based Energy Disaggregation

Modern smart sensor-based energy management systems leverage non-intrusi...
research
03/15/2022

Long Document Summarization with Top-down and Bottom-up Inference

Text summarization aims to condense long documents and retain key inform...
research
10/07/2022

Breaking BERT: Evaluating and Optimizing Sparsified Attention

Transformers allow attention between all pairs of tokens, but there is r...

Please sign up or login with your details

Forgot password? Click here to reset