Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi

04/19/2022
by   Abhishek Velankar, et al.
0

Transformers are the most eminent architectures used for a vast range of Natural Language Processing tasks. These models are pre-trained over a large text corpus and are meant to serve state-of-the-art results over tasks like text classification. In this work, we conduct a comparative study between monolingual and multilingual BERT models. We focus on the Marathi language and evaluate the models on the datasets for hate speech detection, sentiment analysis and simple text classification in Marathi. We use standard multilingual models such as mBERT, indicBERT and xlm-RoBERTa and compare with MahaBERT, MahaALBERT and MahaRoBERTa, the monolingual models for Marathi. We further show that Marathi monolingual models outperform the multilingual BERT variants on five different downstream fine-tuning experiments. We also evaluate sentence embeddings from these models by freezing the BERT encoder layers. We show that monolingual MahaBERT based models provide rich representations as compared to sentence embeddings from multi-lingual counterparts. However, we observe that these embeddings are not generic enough and do not work well on out of domain social media datasets. We consider two Marathi hate speech datasets L3Cube-MahaHate, HASOC-2021, a Marathi sentiment classification dataset L3Cube-MahaSent, and Marathi Headline, Articles classification datasets.

READ FULL TEXT
research
09/21/2022

SMTCE: A Social Media Text Classification Evaluation Benchmark and BERTology Models for Vietnamese

Text classification is a typical natural language processing or computat...
research
10/23/2021

Hate and Offensive Speech Detection in Hindi and Marathi

Sentiment analysis is the most basic NLP task to determine the polarity ...
research
11/21/2022

L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages

The monolingual Hindi BERT models currently available on the model hub d...
research
07/19/2020

Mono vs Multilingual Transformer-based Models: a Comparison across Several Language Tasks

BERT (Bidirectional Encoder Representations from Transformers) and ALBER...
research
08/31/2021

Monolingual versus Multilingual BERTology for Vietnamese Extractive Multi-Document Summarization

Recent researches have demonstrated that BERT shows potential in a wide ...
research
04/12/2022

Easy Adaptation to Mitigate Gender Bias in Multilingual Text Classification

Existing approaches to mitigate demographic biases evaluate on monolingu...
research
03/24/2021

Czert – Czech BERT-like Model for Language Representation

This paper describes the training process of the first Czech monolingual...

Please sign up or login with your details

Forgot password? Click here to reset