Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi

09/19/2023
by   Md. Nishat Raihan, et al.
0

One of the most popular downstream tasks in the field of Natural Language Processing is text classification. Text classification tasks have become more daunting when the texts are code-mixed. Though they are not exposed to such text during pre-training, different BERT models have demonstrated success in tackling Code-Mixed NLP challenges. Again, in order to enhance their performance, Code-Mixed NLP models have depended on combining synthetic data with real-world data. It is crucial to understand how the BERT models' performance is impacted when they are pretrained using corresponding code-mixed languages. In this paper, we introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data. Both models are evaluated across multiple NLP tasks and demonstrate competitive performance against larger models like mBERT and XLM-R. Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding, contributing to advancements in the field.

READ FULL TEXT
research
05/26/2020

ParsBERT: Transformer-based Model for Persian Language Understanding

The surge of pre-trained language models has begun a new era in the fiel...
research
05/25/2023

Comparative Study of Pre-Trained BERT Models for Code-Mixed Hindi-English Data

The term "Code Mixed" refers to the use of more than one language in the...
research
10/09/2022

Spread Love Not Hate: Undermining the Importance of Hateful Pre-training for Hate Speech Detection

Pre-training large neural language models, such as BERT, has led to impr...
research
06/10/2021

CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing

The NLP community has witnessed steep progress in a variety of tasks acr...
research
04/27/2021

Multi-class Text Classification using BERT-based Active Learning

Text Classification finds interesting applications in the pickup and del...
research
09/15/2020

Lessons Learned from Applying off-the-shelf BERT: There is no Silver Bullet

One of the challenges in the NLP field is training large classification ...
research
06/24/2022

Text and author-level political inference using heterogeneous knowledge representations

The inference of politically-charged information from text data is a pop...

Please sign up or login with your details

Forgot password? Click here to reset