Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

01/20/2022
by   Zhuoyuan Mao, et al.
0

In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese–English Japanese–Chinese, Wikipedia Japanese–Chinese, News English–Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japanese–English tasks, up to +7.0 BLEU points for the Japanese–Chinese tasks and up to +1.3 BLEU points for English–Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency. We release codes here: https://github.com/Mao-KU/JASS/tree/master/linguistically-driven-pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/07/2020

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

Neural machine translation (NMT) needs large parallel corpora for state-...
research
01/23/2020

Pre-training via Leveraging Assisting Languages and Data Selection for Neural Machine Translation

Sequence-to-sequence (S2S) pre-training using large monolingual data is ...
research
10/11/2020

SJTU-NICT's Supervised and Unsupervised Neural Machine Translation Systems for the WMT20 News Translation Task

In this paper, we introduced our joint team SJTU-NICT 's participation i...
research
01/22/2020

Multilingual Denoising Pre-training for Neural Machine Translation

This paper demonstrates that multilingual denoising pre-training produce...
research
05/09/2021

Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural Machine Translation

The data scarcity in low-resource languages has become a bottleneck to b...
research
09/07/2022

On the Complementarity between Pre-Training and Random-Initialization for Resource-Rich Machine Translation

Pre-Training (PT) of text representations has been successfully applied ...
research
06/29/2019

Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings

Since Bahdanau et al. [1] first introduced attention for neural machine ...

Please sign up or login with your details

Forgot password? Click here to reset