Faster Transformer Decoding: N-gram Masked Self-Attention

01/14/2020
by   Ciprian Chelba, et al.
0

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence S=s_1, ..., s_S, we propose truncating the target-side window used for computing self-attention by making an N-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the N-gram masked self-attention model loses very little in BLEU score for N values in the range 4, ..., 8, depending on the task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/27/2022

Are Neighbors Enough? Multi-Head Neural n-gram can be Alternative to Self-attention

Impressive performance of Transformer has been attributed to self-attent...
research
05/17/2018

Cross-Target Stance Classification with Self-Attention Networks

In stance classification, the target on which the stance is made defines...
research
12/30/2018

Variational Self-attention Model for Sentence Representation

This paper proposes a variational self-attention model (VSAM) that emplo...
research
06/07/2021

On the Expressive Power of Self-Attention Matrices

Transformer networks are able to capture patterns in data coming from ma...
research
06/11/2019

Cued@wmt19:ewc&lms

Two techniques provide the fabric of the Cambridge University Engineerin...
research
01/31/2023

Fairness-aware Vision Transformer via Debiased Self-Attention

Vision Transformer (ViT) has recently gained significant interest in sol...
research
10/01/2022

Construction and Evaluation of a Self-Attention Model for Semantic Understanding of Sentence-Final Particles

Sentence-final particles serve an essential role in spoken Japanese beca...

Please sign up or login with your details

Forgot password? Click here to reset