EAG: Extract and Generate Multi-way Aligned Corpus for Complete Multi-lingual Neural Machine Translation

03/04/2022
by   Yulin Xu, et al.
0

Complete Multi-lingual Neural Machine Translation (C-MNMT) achieves superior performance against the conventional MNMT by constructing multi-way aligned corpus, i.e., aligning bilingual training examples from different language pairs when either their source or target sides are identical. However, since exactly identical sentences from different language pairs are scarce, the power of the multi-way aligned corpus is limited by its scale. To handle this problem, this paper proposes "Extract and Generate" (EAG), a two-step approach to construct large-scale and high-quality multi-way aligned corpus from bilingual data. Specifically, we first extract candidate aligned examples by pairing the bilingual examples from different language pairs with highly similar source or target sentences; and then generate the final aligned examples from the candidates with a well-trained generation model. With this two-step pipeline, EAG can construct a large-scale and multi-way aligned corpus whose diversity is almost identical to the original bilingual corpus. Experiments on two publicly available datasets i.e., WMT-5 and OPUS-100, show that the proposed method achieves significant improvements over strong baselines, with +1.1 and +1.4 BLEU points improvements on the two datasets respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/20/2020

Complete Multilingual Neural Machine Translation

Multilingual Neural Machine Translation (MNMT) models are commonly train...
research
11/15/2016

Neural Machine Translation with Pivot Languages

While recent neural machine translation approaches have delivered state-...
research
09/01/2021

An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages

The availability of parallel sentence simplification (SS) is scarce for ...
research
11/03/2017

Towards Neural Machine Translation with Partially Aligned Corpora

While neural machine translation (NMT) has become the new paradigm, the ...
research
06/17/2021

Central Kurdish machine translation: First large scale parallel corpus and experiments

While the computational processing of Kurdish has experienced a relative...
research
04/23/2023

NAIST-SIC-Aligned: Automatically-Aligned English-Japanese Simultaneous Interpretation Corpus

It remains a question that how simultaneous interpretation (SI) data aff...
research
03/07/2022

Creating Speech-to-Speech Corpus from Dubbed Series

Dubbed series are gaining a lot of popularity in recent years with stron...

Please sign up or login with your details

Forgot password? Click here to reset