Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

01/21/2023
by   Sin-En Lu, et al.
0

In natural language processing (NLP), code-mixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation tasks. To fit the code-mixing scenario, we adapt XLM slightly. We found that by using linguistic knowledge, rules, and language tags, the model produces good results on CM data translation while maintaining monolingual translation quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/27/2022

Prabhupadavani: A Code-mixed Speech Translation Data for 25 Languages

Nowadays, code-mixing has become ubiquitous in Natural Language Processi...
research
02/23/2023

MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

The multi-sentential long sequence textual data unfolds several interest...
research
03/23/2023

Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

While code-mixing is a common linguistic practice in many parts of the w...
research
09/05/2019

Cross-Lingual Dependency Parsing Using Code-Mixed TreeBank

Treebank translation is a promising method for cross-lingual transfer of...
research
06/10/2021

CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing

The NLP community has witnessed steep progress in a variety of tasks acr...
research
05/18/2021

Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-Mixing

We describe models focused at the understudied problem of translating be...
research
11/09/2019

Code-Mixed to Monolingual Translation Framework

The use of multilingualism in the new generation is widespread in the fo...

Please sign up or login with your details

Forgot password? Click here to reset