Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

03/23/2023
by   Zheng Xin Yong, et al.
0

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The proliferation of Large Language Models (LLMs) in recent times compels one to ask: can these systems be used for data generation? In this article, we explore prompting multilingual LLMs in a zero-shot manner to create code-mixed data for five languages in South East Asia (SEA) – Indonesian, Malay, Chinese, Tagalog, Vietnamese, as well as the creole language Singlish. We find that ChatGPT shows the most potential, capable of producing code-mixed text 68 "code-mixing" is explicitly defined. Moreover, both ChatGPT's and InstructGPT's (davinci-003) performances in generating Singlish texts are noteworthy, averaging a 96 proficiency, however, is dampened by word choice errors that lead to semantic inaccuracies. Other multilingual models such as BLOOMZ and Flan-T5-XXL are unable to produce code-mixed texts altogether. By highlighting the limited promises of LLMs in a specific form of low-resource data generation, we call for a measured approach when applying similar techniques to other data-scarce NLP contexts.

READ FULL TEXT

page 3

page 5

page 7

research
05/01/2020

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Building natural language processing systems for non standardized and lo...
research
06/10/2021

CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing

The NLP community has witnessed steep progress in a variety of tasks acr...
research
06/15/2021

Challenges and Considerations with Code-Mixed NLP for Multilingual Societies

Multilingualism refers to the high degree of proficiency in two or more ...
research
09/02/2023

Multilingual Text Representation

Modern NLP breakthrough includes large multilingual models capable of pe...
research
01/21/2023

Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

In natural language processing (NLP), code-mixing (CM) is a challenging ...
research
10/16/2018

Strategies for Language Identification in Code-Mixed Low Resource Languages

In the recent years, substantial work has been done on language tagging ...
research
11/13/2019

Prevalence of code mixing in semi-formal patient communication in low resource languages of South Africa

In this paper we address the problem of code-mixing in resource-poor lan...

Please sign up or login with your details

Forgot password? Click here to reset