Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages

05/09/2023
by   Kaushal Kumar Maurya, et al.
0

We address the task of machine translation from an extremely low-resource language (LRL) to English using cross-lingual transfer from a closely related high-resource language (HRL). For many of these languages, no parallel corpora are available, even monolingual corpora are limited and representations in pre-trained sequence-to-sequence models are absent. These factors limit the benefits of cross-lingual transfer from shared embedding spaces in multilingual models. However, many extremely LRLs have a high level of lexical similarity with related HRLs. We utilize this property by injecting character and character-span noise into the training data of the HRL prior to learning the vocabulary. This serves as a regularizer which makes the model more robust to lexical divergences between the HRL and LRL and better facilitates cross-lingual transfer. On closely related HRL and LRL pairs from multiple language families, we observe that our method significantly outperforms the baseline MT as well as approaches proposed previously to address cross-lingual transfer between closely related languages. We also show that the proposed character-span noise injection performs better than the unigram-character noise injection.

READ FULL TEXT

page 4

page 11

research
09/14/2021

Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by injecting Character-level Noise

Cross-lingual transfer between a high-resource language and its dialects...
research
01/29/2022

Does Transliteration Help Multilingual Language Modeling?

As there is a scarcity of large representative corpora for most language...
research
05/23/2022

Local Byte Fusion for Neural Machine Translation

Subword tokenization schemes are the dominant technique used in current ...
research
03/03/2022

Overlap-based Vocabulary Generation Improves Cross-lingual Transfer Among Related Languages

Pre-trained multilingual language models such as mBERT and XLM-R have de...
research
06/09/2021

Crosslingual Embeddings are Essential in UNMT for Distant Languages: An English to IndoAryan Case Study

Recent advances in Unsupervised Neural Machine Translation (UNMT) have m...
research
03/30/2023

Fine-Tuning BERT with Character-Level Noise for Zero-Shot Transfer to Dialects and Closely-Related Languages

In this work, we induce character-level noise in various forms when fine...
research
09/05/2018

BPE and CharCNNs for Translation of Morphology: A Cross-Lingual Comparison and Analysis

Neural Machine Translation (NMT) in low-resource settings and of morphol...

Please sign up or login with your details

Forgot password? Click here to reset