From Machine Translation to Code-Switching: Generating High-Quality Code-Switched Text

07/14/2021
by   Ishan Tarunesh, et al.
0

Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text. In this work, we adapt a state-of-the-art neural machine translation model to generate Hindi-English code-switched sentences starting from monolingual Hindi sentences. We outline a carefully designed curriculum of pretraining steps, including the use of synthetic code-switched text, that enable the model to generate high-quality code-switched text. Using text generated from our model as data augmentation, we show significant reductions in perplexity on a language modeling task, compared to using text from other generative models of CS text. We also show improvements using our text for a downstream code-switched natural language inference task. Our generated text is further subjected to a rigorous evaluation using a human evaluation study and a range of objective metrics, where we show performance comparable (and sometimes even superior) to code-switched text obtained via crowd workers who are native Hindi speakers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2023

The Effect of Alignment Objectives on Code-Switching Translation

One of the things that need to change when it comes to machine translati...
research
11/06/2018

Code-switching Sentence Generation by Generative Adversarial Networks and its Application to Data Augmentation

Code-switching is about dealing with alternative languages in speech or ...
research
10/20/2022

The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT)

The University of Edinburgh participated in the WMT22 shared task on cod...
research
10/24/2018

Learn to Code-Switch: Data Augmentation using Copy Mechanism on Language Modeling

Building large-scale datasets for training code-switching language model...
research
05/18/2021

Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

We describe models focused at the understudied problem of translating be...
research
09/06/2018

Code-switched Language Models Using Dual RNNs and Same-Source Pretraining

This work focuses on building language models (LMs) for code-switched te...
research
06/21/2019

A Deep Generative Model for Code-Switched Text

Code-switching, the interleaving of two or more languages within a sente...

Please sign up or login with your details

Forgot password? Click here to reset