MUTANT: A Multi-sentential Code-mixed Hinglish Dataset

02/23/2023
by   Rahul Gupta, et al.
6

The multi-sentential long sequence textual data unfolds several interesting research directions pertaining to natural language processing and generation. Though we observe several high-quality long-sequence datasets for English and other monolingual languages, there is no significant effort in building such resources for code-mixed languages such as Hinglish (code-mixing of Hindi-English). In this paper, we propose a novel task of identifying multi-sentential code-mixed text (MCT) from multilingual articles. As a use case, we leverage multilingual articles from two different data sources and build a first-of-its-kind multi-sentential code-mixed Hinglish dataset i.e., MUTANT. We propose a token-level language-aware pipeline and extend the existing metrics measuring the degree of code-mixing to a multi-sentential framework and automatically identify MCT in the multilingual articles. The MUTANT dataset comprises 67k articles with 85k identified Hinglish MCTs. To facilitate future research, we make the publicly available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2021

Challenges and Limitations with the Metrics Measuring the Complexity of Code-Mixed Text

Code-mixing is a frequent communication style among multilingual speaker...
research
07/08/2021

HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text

Text generation is a highly active area of research in the computational...
research
06/15/2018

A Dataset for Building Code-Mixed Goal Oriented Conversation Systems

There is an increasing demand for goal-oriented conversation systems whi...
research
01/21/2023

Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

In natural language processing (NLP), code-mixing (CM) is a challenging ...
research
04/10/2020

A New Dataset for Natural Language Inference from Code-mixed Conversations

Natural Language Inference (NLI) is the task of inferring the logical re...
research
06/15/2021

Challenges and Considerations with Code-Mixed NLP for Multilingual Societies

Multilingualism refers to the high degree of proficiency in two or more ...
research
11/13/2019

Prevalence of code mixing in semi-formal patient communication in low resource languages of South Africa

In this paper we address the problem of code-mixing in resource-poor lan...

Please sign up or login with your details

Forgot password? Click here to reset