Denoising-based UNMT is more robust to word-order divergence than MASS-based UNMT

03/02/2023
by   Tamali Banerjee, et al.
0

We aim to investigate whether UNMT approaches with self-supervised pre-training are robust to word-order divergence between language pairs. We achieve this by comparing two models pre-trained with the same self-supervised pre-training objective. The first model is trained on language pairs with different word-orders, and the second model is trained on the same language pairs with source language re-ordered to match the word-order of the target language. Ideally, UNMT approaches which are robust to word-order divergence should exhibit no visible performance difference between the two configurations. In this paper, we investigate two such self-supervised pre-training based UNMT approaches, namely Masked Sequence-to-Sequence Pre-Training, (MASS) (which does not have shuffling noise) and Denoising AutoEncoder (DAE), (which has shuffling noise). We experiment with five English→Indic language pairs, i.e., en-hi, en-bn, en-gu, en-kn, and en-ta) where word-order of the source language is SVO (Subject-Verb-Object), and the word-order of the target languages is SOV (Subject-Object-Verb). We observed that for these language pairs, DAE-based UNMT approach consistently outperforms MASS in terms of translation accuracies. Moreover, bridging the word-order gap using reordering improves the translation accuracy of MASS-based UNMT models, while it cannot improve the translation accuracy of DAE-based UNMT models. This observation indicates that DAE-based UNMT is more robust to word-order divergence than MASS-based UNMT. Word-shuffling noise in DAE approach could be the possible reason for the approach being robust to word-order divergence.

READ FULL TEXT
research
06/03/2023

Acoustic Word Embeddings for Untranscribed Target Languages with Continued Pretraining and Learned Pooling

Acoustic word embeddings are typically created by training a pooling fun...
research
12/16/2021

Lacuna Reconstruction: Self-supervised Pre-training for Low-Resource Historical Document Transcription

We present a self-supervised pre-training approach for learning rich vis...
research
04/03/2023

PEACH: Pre-Training Sequence-to-Sequence Multilingual Models for Translation with Semi-Supervised Pseudo-Parallel Document Generation

Multilingual pre-training significantly improves many multilingual NLP t...
research
11/09/2022

Accidental Learners: Spoken Language Identification in Multilingual Self-Supervised Models

In this paper, we extend previous self-supervised approaches for languag...
research
02/27/2023

Self-Supervised Pre-Training for Deep Image Prior-Based Robust PET Image Denoising

Deep image prior (DIP) has been successfully applied to positron emissio...
research
04/14/2022

SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study

Self-supervised pre-training methods have brought remarkable breakthroug...
research
07/01/2023

Self-Supervised Query Reformulation for Code Search

Automatic query reformulation is a widely utilized technology for enrich...

Please sign up or login with your details

Forgot password? Click here to reset