Sharing Attention Weights for Fast Transformer

06/26/2019
by   Tong Xiao, et al.
0

Recently, the Transformer machine translation system has shown strong results by stacking attention layers on both the source and target-language sides. But the inference of this model is slow due to the heavy use of dot-product attention in auto-regressive decoding. In this paper we speed up Transformer via a fast and lightweight attention model. More specifically, we share attention weights in adjacent layers and enable the efficient re-use of hidden states in a vertical manner. Moreover, the sharing policy can be jointly learned with the MT model. We test our approach on ten WMT and NIST OpenMT tasks. Experimental results show that it yields an average of 1.3X speed-up (with almost no decrease in BLEU) on top of a state-of-the-art implementation that has already adopted a cache for fast inference. Also, our approach obtains a 1.8X speed-up when it works with the Aan model. This is even 16 times faster than the baseline with no use of the attention cache.

READ FULL TEXT
research
05/02/2018

Accelerating Neural Transformer via an Average Attention Network

With parallelizable attention networks, the neural Transformer is very f...
research
05/11/2021

EL-Attention: Memory Efficient Lossless Attention for Generation

Transformer model with multi-head attention requires caching intermediat...
research
11/24/2021

Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are...
research
05/26/2023

TranSFormer: Slow-Fast Transformer for Machine Translation

Learning multiscale Transformer models has been evidenced as a viable ap...
research
09/16/2021

Fast-Slow Transformer for Visually Grounding Speech

We present Fast-Slow Transformer for Visually Grounding Speech, or FaST-...
research
05/31/2021

G-Transformer for Document-level Machine Translation

Document-level MT models are still far from satisfactory. Existing work ...
research
11/06/2019

Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence ...

Please sign up or login with your details

Forgot password? Click here to reset