Predicting the Type and Target of Offensive Social Media Posts in Marathi

11/22/2022
by   Marcos Zampieri, et al.
0

The presence of offensive language on social media is very common motivating platforms to invest in strategies to make communities safer. This includes developing robust machine learning systems capable of recognizing offensive content online. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English and a few other high resource languages such as French, German, and Spanish. In this paper we address this gap by tackling offensive language identification in Marathi, a low-resource Indo-Aryan language spoken in India. We introduce the Marathi Offensive Language Dataset v.2.0 or MOLD 2.0 and present multiple experiments on this dataset. MOLD 2.0 is a much larger version of MOLD with expanded annotation to the levels B (type) and C (target) of the popular OLID taxonomy. MOLD 2.0 is the first hierarchical offensive language dataset compiled for Marathi, thus opening new avenues for research in low-resource Indo-Aryan languages. Finally, we also introduce SeMOLD, a larger dataset annotated following the semi-supervised methods presented in SOLID.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/08/2021

Cross-lingual Offensive Language Identification for Low Resource Languages: The Case of Marathi

The widespread presence of offensive language on social media motivated ...
research
12/01/2022

SOLD: Sinhala Offensive Language Dataset

The widespread of offensive content online, such as hate speech and cybe...
research
04/29/2020

A Large-Scale Semi-Supervised Dataset for Offensive Language Identification

The use of offensive language is a major problem in social media which h...
research
11/10/2020

Detecting Social Media Manipulation in Low-Resource Languages

Social media have been deliberately used for malicious purposes, includi...
research
11/05/2021

Developing Successful Shared Tasks on Offensive Language Identification for Dravidian Languages

With the fast growth of mobile computing and Web technologies, offensive...
research
08/30/2023

Cyberbullying Detection for Low-resource Languages and Dialects: Review of the State of the Art

The struggle of social media platforms to moderate content in a timely m...
research
07/12/2021

End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents

Chatbots are intelligent software built to be used as a replacement for ...

Please sign up or login with your details

Forgot password? Click here to reset