Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation

05/16/2023
by   Zhuoyuan Mao, et al.
0

This paper studies the impact of layer normalization (LayerNorm) on zero-shot translation (ZST). Recent efforts for ZST often utilize the Transformer architecture as the backbone, with LayerNorm at the input of layers (PreNorm) set as the default. However, Xu et al. (2019) has revealed that PreNorm carries the risk of overfitting the training data. Based on this, we hypothesize that PreNorm may overfit supervised directions and thus have low generalizability for ZST. Through experiments on OPUS, IWSLT, and Europarl datasets for 54 ZST directions, we demonstrate that the original Transformer setting of LayerNorm after residual connections (PostNorm) consistently outperforms PreNorm by up to 12.3 BLEU points. We then study the performance disparities by analyzing the differences in off-target rates and structural variations between PreNorm and PostNorm. This study highlights the need for careful consideration of the LayerNorm setting for ZST.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2018

Zero-Shot Dual Machine Translation

Neural Machine Translation (NMT) systems rely on large amounts of parall...
research
12/30/2020

Improving Zero-Shot Translation by Disentangling Positional Information

Multilingual neural machine translation has shown the capability of dire...
research
06/15/2021

Language Tags Matter for Zero-Shot Neural Machine Translation

Multilingual Neural Machine Translation (MNMT) has aroused widespread in...
research
09/10/2021

Rethinking Zero-shot Neural Machine Translation: From a Perspective of Latent Variables

Zero-shot translation, directly translating between language pairs unsee...
research
08/11/2022

Language Tokens: A Frustratingly Simple Approach Improves Zero-Shot Performance of Multilingual Translation

This paper proposes a simple yet effective method to improve direct (X-t...
research
08/28/2023

An Empirical Study of Consistency Regularization for End-to-End Speech-to-Text Translation

Consistency regularization methods, such as R-Drop (Liang et al., 2021) ...

Please sign up or login with your details

Forgot password? Click here to reset