Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

08/01/2022
by   Tan Nguyen, et al.
2

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the momentum transformer, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy.

READ FULL TEXT
research
06/29/2020

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Transformers achieve remarkable performance in several tasks but due to ...
research
02/23/2022

FastRPB: a Scalable Relative Positional Encoding for Long Sequence Tasks

Transformers achieve remarkable performance in various domains, includin...
research
05/27/2022

What Dense Graph Do You Need for Self-Attention?

Transformers have made progress in miscellaneous tasks, but suffer from ...
research
10/13/2021

How Does Momentum Benefit Deep Neural Networks Architecture Design? A Few Case Studies

We present and review an algorithmic and theoretical framework for impro...
research
02/02/2023

On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

A number of recent adaptive optimizers improve the generalisation perfor...
research
11/25/2021

Wake Word Detection with Streaming Transformers

Modern wake word detection systems usually rely on neural networks for a...
research
08/05/2021

FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention

We propose FMMformers, a class of efficient and flexible transformers in...

Please sign up or login with your details

Forgot password? Click here to reset