DeepAI
Log In Sign Up

Linguistically-driven Multi-task Pre-training for Low-resource Neural Machine Translation

01/20/2022
by   Zhuoyuan Mao, et al.
0

In the present study, we propose novel sequence-to-sequence pre-training objectives for low-resource machine translation (NMT): Japanese-specific sequence to sequence (JASS) for language pairs involving Japanese as the source or target language, and English-specific sequence to sequence (ENSS) for language pairs involving English. JASS focuses on masking and reordering Japanese linguistic units known as bunsetsu, whereas ENSS is proposed based on phrase structure masking and reordering tasks. Experiments on ASPEC Japanese–English Japanese–Chinese, Wikipedia Japanese–Chinese, News English–Korean corpora demonstrate that JASS and ENSS outperform MASS and other existing language-agnostic pre-training methods by up to +2.9 BLEU points for the Japanese–English tasks, up to +7.0 BLEU points for the Japanese–Chinese tasks and up to +1.3 BLEU points for English–Korean tasks. Empirical analysis, which focuses on the relationship between individual parts in JASS and ENSS, reveals the complementary nature of the subtasks of JASS and ENSS. Adequacy evaluation using LASER, human evaluation, and case studies reveals that our proposed methods significantly outperform pre-training methods without injected linguistic knowledge and they have a larger positive impact on the adequacy as compared to the fluency. We release codes here: https://github.com/Mao-KU/JASS/tree/master/linguistically-driven-pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

05/07/2020

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation

Neural machine translation (NMT) needs large parallel corpora for state-...
01/23/2020

Pre-training via Leveraging Assisting Languages and Data Selection for Neural Machine Translation

Sequence-to-sequence (S2S) pre-training using large monolingual data is ...
01/22/2020

Multilingual Denoising Pre-training for Neural Machine Translation

This paper demonstrates that multilingual denoising pre-training produce...
05/07/2019

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Pre-training and fine-tuning, e.g., BERT, have achieved great success in...
03/12/2021

Bilingual Dictionary-based Language Model Pretraining for Neural Machine Translation

Recent studies have demonstrated a perceivable improvement on the perfor...
09/07/2022

On the Complementarity between Pre-Training and Random-Initialization for Resource-Rich Machine Translation

Pre-Training (PT) of text representations has been successfully applied ...
11/07/2019

Microsoft Research Asia's Systems for WMT19

We Microsoft Research Asia made submissions to 11 language directions in...

Code Repositories

JASS

JASS: Japanese-specific Sequence to Sequence Pre-training for Neural Machine Translation (LREC2020)


view repo