Parallel Attention Forcing for Machine Translation

11/06/2022
by   Qingyun Dou, et al.
0

Attention-based autoregressive models have achieved state-of-the-art performance in various sequence-to-sequence tasks, including Text-To-Speech (TTS) and Neural Machine Translation (NMT), but can be difficult to train. The standard training approach, teacher forcing, guides a model with the reference back-history. During inference, the generated back-history must be used. This mismatch limits the evaluation performance. Attention forcing has been introduced to address the mismatch, guiding the model with the generated back-history and reference attention. While successful in tasks with continuous outputs like TTS, attention forcing faces additional challenges in tasks with discrete outputs like NMT. This paper introduces the two extensions of attention forcing to tackle these challenges. (1) Scheduled attention forcing automatically turns attention forcing on and off, which is essential for tasks with discrete outputs. (2) Parallel attention forcing makes training parallel, and is applicable to Transformer-based models. The experiments show that the proposed approaches improve the performance of models based on RNNs and Transformers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/02/2021

Attention Forcing for Machine Translation

Auto-regressive sequence-to-sequence models with attention mechanisms ha...
research
09/26/2019

Attention Forcing for Sequence-to-sequence Model Training

Auto-regressive sequence-to-sequence models with attention mechanism hav...
research
04/17/2023

Improving Autoregressive NLP Tasks via Modular Linearized Attention

Various natural language processing (NLP) tasks necessitate models that ...
research
06/06/2020

Challenges and Thrills of Legal Arguments

State-of-the-art attention based models, mostly centered around the tran...
research
11/06/2022

Deliberation Networks and How to Train Them

Deliberation networks are a family of sequence-to-sequence models, which...
research
09/22/2017

Attention-based Mixture Density Recurrent Networks for History-based Recommendation

The goal of personalized history-based recommendation is to automaticall...
research
08/30/2019

Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performanc...

Please sign up or login with your details

Forgot password? Click here to reset