Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study

06/09/2021
by   Tamali Banerjee, et al.
0

Recent advances in Unsupervised Neural Machine Translation (UNMT) have minimized the gap between supervised and unsupervised machine translation performance for closely related language pairs. However, the situation is very different for distant language pairs. Lack of lexical overlap and low syntactic similarities such as between English and Indo-Aryan languages leads to poor translation quality in existing UNMT systems. In this paper, we show that initializing the embedding layer of UNMT models with cross-lingual embeddings shows significant improvements in BLEU score over existing approaches with embeddings randomly initialized. Further, static embeddings (freezing the embedding layer weights) lead to better gains compared to updating the embedding layer weights during training (non-static). We experimented using Masked Sequence to Sequence (MASS) and Denoising Autoencoder (DAE) UNMT approaches for three distant language pairs. The proposed cross-lingual embedding initialization yields BLEU score improvement of as much as ten times over the baseline for English-Hindi, English-Bengali, and English-Gujarati. Our analysis shows the importance of cross-lingual embedding, comparisons between approaches, and the scope of improvements in these systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2021

Improving the Lexical Ability of Pretrained Language Models for Unsupervised Neural Machine Translation

Successful methods for unsupervised neural machine translation (UNMT) em...
research
10/11/2017

Word Translation Without Parallel Data

State-of-the-art methods for learning cross-lingual word embeddings have...
research
05/09/2023

Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages

We address the task of machine translation from an extremely low-resourc...
research
03/11/2021

Unsupervised Transfer Learning in Multilingual Neural Machine Translation with Cross-Lingual Word Embeddings

In this work we look into adding a new language to a multilingual NMT sy...
research
05/31/2022

Don't Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings

Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-li...
research
06/06/2019

Unsupervised Pivot Translation for Distant Languages

Unsupervised neural machine translation (NMT) has attracted a lot of att...
research
10/11/2022

Cross-Lingual Speaker Identification Using Distant Supervision

Speaker identification, determining which character said each utterance ...

Please sign up or login with your details

Forgot password? Click here to reset