A Twitter BERT Approach for Offensive Language Detection in Marathi

12/20/2022
by   Tanmay Chavan, et al.
0

Automated offensive language detection is essential in combating the spread of hate speech, particularly in social media. This paper describes our work on Offensive Language Identification in low resource Indic language Marathi. The problem is formulated as a text classification task to identify a tweet as offensive or non-offensive. We evaluate different mono-lingual and multi-lingual BERT models on this classification task, focusing on BERT models pre-trained with social media datasets. We compare the performance of MuRIL, MahaTweetBERT, MahaTweetBERT-Hateful, and MahaBERT on the HASOC 2022 test set. We also explore external data augmentation from other existing Marathi hate speech corpus HASOC 2021 and L3Cube-MahaHate. The MahaTweetBERT, a BERT model, pre-trained on Marathi tweets when fine-tuned on the combined dataset (HASOC 2021 + HASOC 2022 + MahaHate), outperforms all models with an F1 score of 98.43 on the HASOC 2022 test set. With this, we also provide a new state-of-the-art result on HASOC 2022 / MOLD v2 test set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/06/2019

UM-IU@LING at SemEval-2019 Task 6: Identifying Offensive Tweets Using BERT and SVMs

This paper describes the UM-IU@LING's system for the SemEval 2019 Task 6...
research
02/14/2022

Punctuation restoration in Swedish through fine-tuned KB-BERT

Presented here is a method for automatic punctuation restoration in Swed...
research
12/05/2020

Enhanced Offensive Language Detection Through Data Augmentation

Detecting offensive language on social media is an important task. The I...
research
02/21/2022

Items from Psychometric Tests as Training Data for Personality Profiling Models of Twitter Users

Machine-learned models for author profiling in social media often rely o...
research
11/05/2021

Sexism Identification in Tweets and Gabs using Deep Neural Networks

Through anonymisation and accessibility, social media platforms have fac...
research
06/16/2022

DIALOG-22 RuATD Generated Text Detection

Text Generation Models (TGMs) succeed in creating text that matches huma...
research
01/26/2023

A benchmark for toxic comment classification on Civil Comments dataset

Toxic comment detection on social media has proven to be essential for c...

Please sign up or login with your details

Forgot password? Click here to reset