Language Agnostic Code-Mixing Data Augmentation by Predicting Linguistic Patterns

11/14/2022
by   Shuyue Stella Li, et al.
0

In this work, we focus on intrasentential code-mixing and propose several different Synthetic Code-Mixing (SCM) data augmentation methods that outperform the baseline on downstream sentiment analysis tasks across various amounts of labeled gold data. Most importantly, our proposed methods demonstrate that strategically replacing parts of sentences in the matrix language with a constant mask significantly improves classification accuracy, motivating further linguistic insights into the phenomenon of code-mixing. We test our data augmentation method in a variety of low-resource and cross-lingual settings, reaching up to a relative improvement of 7.73 scarce English-Malayalam dataset. We conclude that the code-switch pattern in code-mixing sentences is also important for the model to learn. Finally, we propose a language-agnostic SCM algorithm that is cheap yet extremely helpful for low-resource languages.

READ FULL TEXT
research
09/06/2019

A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages

Parsers are available for only a handful of the world's languages, since...
research
06/16/2022

PreCogIIITH at HinglishEval : Leveraging Code-Mixing Metrics Language Model Embeddings To Estimate Code-Mix Quality

Code-Mixing is a phenomenon of mixing two or more languages in a speech ...
research
04/10/2023

Transfer Learning for Low-Resource Sentiment Analysis

Sentiment analysis is the process of identifying and extracting subjecti...
research
09/08/2020

kk2018 at SemEval-2020 Task 9: Adversarial Training for Code-Mixing Sentiment Classification

Code switching is a linguistic phenomenon that may occur within a multil...
research
10/13/2022

CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing

A bottleneck to developing Semantic Parsing (SP) models is the need for ...
research
12/01/2020

Automatically Identifying Language Family from Acoustic Examples in Low Resource Scenarios

Existing multilingual speech NLP works focus on a relatively small subse...
research
11/13/2019

Prevalence of code mixing in semi-formal patient communication in low resource languages of South Africa

In this paper we address the problem of code-mixing in resource-poor lan...

Please sign up or login with your details

Forgot password? Click here to reset