Learning Hard Retrieval Cross Attention for Transformer

09/30/2020
by   Hongfei Xu, et al.
0

The Transformer translation model that based on the multi-head attention mechanism can be parallelized easily and lead to competitive performance in machine translation. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by jointly attending to information from different representation subspaces at different positions. Though its advantages in parallelization, many previous works suggest the computation of the attention mechanism is not sufficiently efficient, especially when processing long sequences, and propose approaches to improve its efficiency with long sentences. In this paper, we accelerate the inference of the scaled dot-product attention in another perspective. Specifically, instead of squeezing the sequence to attend, we simplify the computation of the scaled dot-product attention by learning a hard retrieval attention which only attends to one token in the sentence rather than all tokens. Since the hard attention mechanism only attends to one position, the matrix multiplication between attention probabilities and the value sequence in the standard scaled dot-product attention can be replaced by a simple and efficient retrieval operation. As a result, our hard retrieval attention mechanism can empirically accelerate the scaled dot-product attention for both long and short sequences by 66.5 range of machine translation tasks when using for cross attention networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/17/2019

MUSE: Parallel Multi-Scale Attention for Sequence to Sequence Learning

In sequence to sequence learning, the self-attention mechanism proves to...
research
07/10/2022

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and ...
research
10/18/2021

Compositional Attention: Disentangling Search and Retrieval

Multi-head, key-value attention is the backbone of the widely successful...
research
06/11/2020

Implicit Kernel Attention

Attention compute the dependency between representations, and it encoura...
research
02/28/2019

Link Prediction with Mutual Attention for Text-Attributed Networks

In this extended abstract, we present an algorithm that learns a similar...
research
06/06/2020

Challenges and Thrills of Legal Arguments

State-of-the-art attention based models, mostly centered around the tran...
research
01/26/2022

When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism

Attention mechanism has been widely believed as the key to success of vi...

Please sign up or login with your details

Forgot password? Click here to reset