Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

06/28/2019
by   Denis Emelin, et al.
0

The transformer is a state-of-the-art neural translation model that uses attention to iteratively refine lexical representations with information drawn from the surrounding context. Lexical features are fed into the first layer and propagated through a deep network of hidden layers. We argue that the need to represent and propagate lexical features in each layer limits the model's capacity for learning and representing other information relevant to the task. To alleviate this bottleneck, we introduce gated shortcut connections between the embedding layer and each subsequent layer within the encoder and decoder. This enables the model to access relevant lexical content dynamically, without expending limited resources on storing it within intermediate states. We show that the proposed modification yields consistent improvements over a baseline transformer on standard WMT translation tasks in 5 translation directions (0.9 BLEU on average) and reduces the amount of lexical information passed along the hidden layers. We furthermore evaluate different ways to integrate lexical connections into the transformer architecture and present ablation experiments exploring the effect of proposed shortcuts on model behavior.

READ FULL TEXT

page 6

page 13

page 14

research
07/29/2022

GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation

Transformer structure, stacked by a sequence of encoder and decoder netw...
research
11/09/2020

BERT-JAM: Boosting BERT-Enhanced Neural Machine Translation with Joint Attention

BERT-enhanced neural machine translation (NMT) aims at leveraging BERT-e...
research
10/03/2017

Improving Lexical Choice in Neural Machine Translation

We explore two solutions to the problem of mistranslating rare words in ...
research
07/19/2021

Residual Tree Aggregation of Layers for Neural Machine Translation

Although attention-based Neural Machine Translation has achieved remarka...
research
08/29/2019

Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

The general trend in NLP is towards increasing model capacity and perfor...
research
09/23/2020

Multi-Pass Transformer for Machine Translation

In contrast with previous approaches where information flows only toward...
research
10/23/2019

Deja-vu: Double Feature Presentation in Deep Transformer Networks

Deep acoustic models typically receive features in the first layer of th...

Please sign up or login with your details

Forgot password? Click here to reset