My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

06/24/2023
by   Tanmay Chavan, et al.
0

The research on code-mixed data is limited due to the unavailability of dedicated code-mixed datasets and pre-trained language models. In this work, we focus on the low-resource Indian language Marathi which lacks any prior work in code-mixing. We present L3Cube-MeCorpus, a large code-mixed Marathi-English (Mr-En) corpus with 5 million tweets for pretraining. We also release L3Cube-MeBERT and MeRoBERTa, code-mixed BERT-based transformer models pre-trained on MeCorpus. Furthermore, for benchmarking, we present three supervised datasets MeHate, MeSent, and MeLID for downstream tasks like code-mixed Mr-En hate speech detection, sentiment analysis, and language identification respectively. These evaluation datasets individually consist of manually annotated < >12,000 Marathi-English code-mixed tweets. Ablations show that the models trained on this novel corpus significantly outperform the existing state-of-the-art BERT models. This is the first work that presents artifacts for code-mixed Marathi research. All datasets and models are publicly released at https://github.com/l3cube-pune/MarathiNLP .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/18/2022

L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

Code-switching occurs when more than one language is mixed in a given se...
research
05/25/2023

Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

The term "Code Mixed" refers to the use of more than one language in the...
research
06/08/2023

Leveraging Language Identification to Enhance Code-Mixed Text Classification

The usage of more than one language in the same text is referred to as C...
research
06/10/2022

Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing

We present a new corpus of Twitter data annotated for codeswitching and ...
research
03/31/2017

Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data

In this paper, we propose efficient and less resource-intensive strategi...
research
09/13/2022

Robin: A Novel Online Suicidal Text Corpus of Substantial Breadth and Scale

Suicide is a major public health crisis. With more than 20,000,000 suici...
research
11/13/2018

Hate Speech Detection from Code-mixed Hindi-English Tweets Using Deep Learning Models

This paper reports an increment to the state-of-the-art in hate speech d...

Please sign up or login with your details

Forgot password? Click here to reset