L3Cube-HingCorpus and HingBERT: A Code Mixed Hindi-English Dataset and BERT Language Models

04/18/2022
by   Ravindra Nayak, et al.
0

Code-switching occurs when more than one language is mixed in a given sentence or a conversation. This phenomenon is more prominent on social media platforms and its adoption is increasing over time. Therefore code-mixed NLP has been extensively studied in the literature. As pre-trained transformer-based architectures are gaining popularity, we observe that real code-mixing data are scarce to pre-train large language models. We present L3Cube-HingCorpus, the first large-scale real Hindi-English code mixed data in a Roman script. It consists of 52.93M sentences and 1.04B tokens, scraped from Twitter. We further present HingBERT, HingMBERT, HingRoBERTa, and HingGPT. The BERT models have been pre-trained on codemixed HingCorpus using masked language modelling objectives. We show the effectiveness of these BERT models on the subsequent downstream tasks like code-mixed sentiment analysis, POS tagging, NER, and LID from the GLUECoS benchmark. The HingGPT is a GPT2 based generative transformer model capable of generating full tweets. We also release L3Cube-HingLID Corpus, the largest code-mixed Hindi-English language identification(LID) dataset and HingBERT-LID, a production-quality LID model to facilitate capturing of more code-mixed data using the process outlined in this work. The dataset and models are available at https://github.com/l3cube-pune/code-mixed-nlp .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2023

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

The research on code-mixed data is limited due to the unavailability of ...
research
04/27/2020

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

Large-scale pre-trained language models such as BERT have brought signif...
research
05/25/2023

Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

The term "Code Mixed" refers to the use of more than one language in the...
research
07/02/2021

Language Identification of Hindi-English tweets using code-mixed BERT

Language identification of social media text has been an interesting pro...
research
06/08/2023

Leveraging Language Identification to Enhance Code-Mixed Text Classification

The usage of more than one language in the same text is referred to as C...
research
02/02/2022

L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources

We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from...
research
05/24/2021

Neural Language Models for Nineteenth-Century English

We present four types of neural language models trained on a large histo...

Please sign up or login with your details

Forgot password? Click here to reset