Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

05/18/2021
by   Ganesh Jawahar, et al.
0

We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed Hindi and English). Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models (i.e., mT5 and mBART) on the task finding both to work well. Given the paucity of training data for code-mixing, we also propose a dependency-free method for generating code-mixed texts from bilingual distributed representations that we exploit for improving language model performance. In particular, armed with this additional data, we adopt a curriculum learning approach where we first finetune the language models on synthetic data then on gold code-mixed data. We find that, although simple, our synthetic code-mixing method is competitive with (and in some cases is even superior to) several standard methods (backtranslation, method based on equivalence constraint theory) under a diverse set of conditions. Our work shows that the mT5 model, finetuned following the curriculum learning procedure, achieves best translation performance (12.67 BLEU). Our models place first in the overall ranking of the English-Hinglish official shared task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2021

Investigating Code-Mixed Modern Standard Arabic-Egyptian to English Machine Translation

Recent progress in neural machine translation (NMT) has made it possible...
research
10/20/2022

The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT)

The University of Edinburgh participated in the WMT22 shared task on cod...
research
11/07/2019

The LIG system for the English-Czech Text Translation Task of IWSLT 2019

In this paper, we present our submission for the English to Czech Text T...
research
07/14/2021

From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

Generating code-switched text is a problem of growing interest, especial...
research
01/21/2023

Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

In natural language processing (NLP), code-mixing (CM) is a challenging ...
research
10/31/2022

Domain Curricula for Code-Switched MT at MixMT 2022

In multilingual colloquial settings, it is a habitual occurrence to comp...
research
05/26/2023

Code-Switched Text Synthesis in Unseen Language Pairs

Existing efforts on text synthesis for code-switching mostly require tra...

Please sign up or login with your details

Forgot password? Click here to reset